Merging results from model.predict() with original pandas DataFrame? - python

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2

You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')

predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)

This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)

Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']

First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]

you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Related

Adding the dataframe name when looping over dataframes

I wrote the following code to loop the same function over different dataframes (named "Drought", "Flashflood",etc). I was happy to see that it worked, but I'm trying to determine how to get the names of the dataframes to append with the train and test scores. Can someone please guide me on what I'm missing here? If I do it like I currently have it, all the names post in each row at the bottom, but I only want the corresponding one. Similarly, the output I get appends each new array together, but my understanding was that append would just add a new item to a list?
For example I'm getting this as a result:
[(0.11995478823013683, -0.07264567664161303), (0.11998113643282327, -0.034458152253100005)]
But I would expect this:
[("Drought",0.11995478823013683, -0.07264567664161303)]
[("Flashflood",0.11998113643282327, -0.034458152253100005)]
Here's the code:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
df_list = [Drought, Flashflood, Flood, Gale]
names = ['Drought','Flashflood','Flood','Gale']
knn_r_acc = []
rmse_val = [] #to store rmse values for different dataframes
for df in df_list:
X = df[['Year.Month','IDH.M_2000','Population','IDH.M_2010']]
y = df['Deceased'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Scaling
scaler = MinMaxScaler(feature_range=(0, 1))
x_train_scaled = scaler.fit_transform(X_train)
x_train = pd.DataFrame(x_train_scaled)
x_test_scaled = scaler.fit_transform(X_test)
x_test = pd.DataFrame(x_test_scaled)
model = neighbors.KNeighborsRegressor(n_neighbors = 3, weights = 'uniform')
model.fit(x_train, y_train) #fit the model
pred=model.predict(x_test) #make prediction on test set
error = sqrt(mean_squared_error(y_test,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
#print('Model= ' , df, 'is:', error)
knn.fit(X_train,y_train)
test_score = knn.score(X_test,y_test)
train_score = knn.score(X_train,y_train)
#print(test_score)
#print(train_score)
knn_r_acc.append((names,train_score, test_score))
print(knn_r_acc)
In your case, names is actually a whole list / array.
You can implement it using an index variable. So before the loop starts, add:
name_index = 0
and inside the loop, append like so:
knn_r_acc.append((names[name_index], train_score, test_score))
name_index += 1

Test train data split - Machine learning

I am trying to do some test train split (90% and 10%) and used below query
X_train, X_test, y_train, y_test = train_test_split(pdf.drop(columns = list(set(cols_not_used).union(set(['RANK'])))) , pdf['RANK'], random_state = 13, train_size = 0.9)
But in my dataframe I have "date" column and would like to split train and test based on date.
i.e., test data as latest 3 months information and the rest as train dataset.
Please let me know how that can be done.
Not sure if this is the best way to proceed, but you can sort the data by time.
df = df.sort_values(by="date")
After sorting it by time you can create your train test set:
import numpy as np
train_set, test_set = np.split(data, [int(0.9 *len(df))])
The next step is to simply split between your predictory and explanatory variables:
xtrain = train_set.drop('Survived', axis=1)
ytrain = train_set['Survived']
xtest = test_set.drop('Survived', axis=1)
ytest = test_set['Survived']

How to preserve the unique IDs of rows when doing machine learning?

I have a dataset X that contains an ID column, some other features, and a target column. I am doing a classification task, and after doing the classification on the test set, I want to see which ID belongs to which class.
So, I do the following:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
df = pd.read_csv('Dataset.csv')
X = df.drop(['ID', 'Target_Feature'], axis=1)
Y = df[['ID', 'Target_Feature']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)
pol_ids = Y_test.ID ### Save the IDs of the test set to append to a new dataframe later
Y_train = Y_train.drop(['ID'], axis=1).values
Y_test = Y_test.drop(['ID'], axis=1).values
logReg = LogisticRegression()
logReg.fit(X_train, Y_train)
logReg.score(X_train, Y_train)
>>> 0.6300364252164744
predictions = logReg.predict(X_test)
predictions
>>> array([1, 0, 0, ..., 0, 1, 0], dtype=int64)
Then I do the following to construct a new dataframe with the ID column and the predictions:
y_pred = logReg.predict_proba(X_test)
df1 = pd.DataFrame(pol_ids)
df1 = df1.reset_index(drop=True)
df2 = pd.DataFrame(y_pred[:,1])
df1['Predictions']=df2
df1['Name']=df.loc[df1.index]['Name'].values ### This is one of the columns in the original dataframe
But, when I check the row in the original dataframe, df, for a given ID, its name is not the same in the new dataframe, df1. This means, most likely, that the IDs have not been correctly copied to the new dataframe.
So, how can I do that?
Check your last line
df1['Name']=df.loc[df1.index]['Name'].values
After reset_index , the index is change, so change to
df1['Name']=df.loc[pol_ids.index]['Name'].values

python pandas split to testing and training data [duplicate]

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
I would just use numpy's randn:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Pandas random sample will also work
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.
I would use scikit-learn's own training_test_split, and generate it from the index
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
And if you want to split x from y
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
And if you want to split the whole df
X, y = df[list_of_x_cols], df[y_col]
There are many ways to create a train/test and even validation samples.
Case 1: classic way train_test_split without any options:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
You can use below code to create test and train samples :
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
There are many valid answers. Adding one more to the bunch.
from sklearn.cross_validation import train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.
You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.
train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
If you need to split your data with respect to the lables column in your data set you can use this:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
and use it:
train, test = split_to_train_test(data, 'class', 0.7)
you can also pass random_state if you want to control the split randomness or use some global random seed.
To split into more than two classes such as train, test, and validation, one can do:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
This will put approximately 70% of data in training, 15% in test, and 15% in validation.
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
Just select range row from df like this
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
import pandas as pd
from sklearn.model_selection import train_test_split
datafile_name = 'path_to_data_file'
data = pd.read_csv(datafile_name)
target_attribute = data['column_name']
X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
if you want to split it to train, test and validation set you can use this function:
from sklearn.model_selection import train_test_split
import pandas as pd
def train_test_val_split(df, test_size=0.15, val_size=0.45):
temp, test = train_test_split(df, test_size=test_size)
total_items_count = len(df.index)
val_length = total_items_count * val_size
new_val_propotion = val_length / len(temp.index)
train, val = train_test_split(temp, test_size=new_val_propotion)
return train, test, val
If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
You can make use of df.as_matrix() function and create Numpy-array and pass it.
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution
First, assign a unique id to a dataframe (if already not exist)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
Here are my split numbers:
train = 120765
test = 4134
dev = 2816
The split function
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
Now splitting into train, test, dev
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
The sample method selects a part of data, you can shuffle the data first by passing a seed value.
train = df.sample(frac=0.8, random_state=42)
For test set you can drop the rows through indexes of train DF and then reset the index of new DF.
test = df.drop(train_data.index).reset_index(drop=True)
How about this?
df is my dataframe
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
I would use K-fold cross validation.
It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.
Method 1: Split rows into train, validate, test dataframes.
train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
Method 2: Split rows when validate must be subset of train (fastai)
train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))
That's what I do:
train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset

How do I create test and train samples from one dataframe with pandas?

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
I would just use numpy's randn:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Pandas random sample will also work
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.
I would use scikit-learn's own training_test_split, and generate it from the index
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
And if you want to split x from y
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
And if you want to split the whole df
X, y = df[list_of_x_cols], df[y_col]
There are many ways to create a train/test and even validation samples.
Case 1: classic way train_test_split without any options:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
You can use below code to create test and train samples :
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
There are many valid answers. Adding one more to the bunch.
from sklearn.cross_validation import train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.
You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.
train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
If you need to split your data with respect to the lables column in your data set you can use this:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
and use it:
train, test = split_to_train_test(data, 'class', 0.7)
you can also pass random_state if you want to control the split randomness or use some global random seed.
To split into more than two classes such as train, test, and validation, one can do:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
This will put approximately 70% of data in training, 15% in test, and 15% in validation.
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
Just select range row from df like this
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
import pandas as pd
from sklearn.model_selection import train_test_split
datafile_name = 'path_to_data_file'
data = pd.read_csv(datafile_name)
target_attribute = data['column_name']
X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
if you want to split it to train, test and validation set you can use this function:
from sklearn.model_selection import train_test_split
import pandas as pd
def train_test_val_split(df, test_size=0.15, val_size=0.45):
temp, test = train_test_split(df, test_size=test_size)
total_items_count = len(df.index)
val_length = total_items_count * val_size
new_val_propotion = val_length / len(temp.index)
train, val = train_test_split(temp, test_size=new_val_propotion)
return train, test, val
If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:
def split_data(df, train_perc = 0.8):
df['train'] = np.random.rand(len(df)) < train_perc
train = df[df.train == 1]
test = df[df.train == 0]
split_data ={'train': train, 'test': test}
return split_data
I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.
msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)
You can make use of df.as_matrix() function and create Numpy-array and pass it.
Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)
A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.
def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r
you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution
First, assign a unique id to a dataframe (if already not exist)
import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]
Here are my split numbers:
train = 120765
test = 4134
dev = 2816
The split function
def df_split(df, n):
first = df.sample(n)
second = df[~df.id.isin(list(first['id']))]
first.reset_index(drop=True, inplace = True)
second.reset_index(drop=True, inplace = True)
return first, second
Now splitting into train, test, dev
train, test = df_split(df, 120765)
test, dev = df_split(test, 4134)
The sample method selects a part of data, you can shuffle the data first by passing a seed value.
train = df.sample(frac=0.8, random_state=42)
For test set you can drop the rows through indexes of train DF and then reset the index of new DF.
test = df.drop(train_data.index).reset_index(drop=True)
How about this?
df is my dataframe
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
I would use K-fold cross validation.
It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.
Method 1: Split rows into train, validate, test dataframes.
train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
Method 2: Split rows when validate must be subset of train (fastai)
train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))
That's what I do:
train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset

Categories

Resources