Seperate Dataframe by label (convert dataframe into numpy array)

Seperate Dataframe by label (convert dataframe into numpy array) - python

I have a dataframe and I want to separate the them into different arrays according to their label, I'm not sure on how to filter it by its index. Not sure on how if this is properly being done:
Example of Dataset (df)
Cancer_Type | Variable | Data Split | Target
Cancer1 43 Train Good
Cancer5 34 Train Bad
Cancer2 34 Test Good
Cancer3 23 Test Bad
Cancer4 25 Test Good
Possibly doing something like this?
#initial split into train/test data
train = df['split'] == 'train'
print("train")
print(train)
test = df['split'] == 'test'
print("valid")
print(test)
X_test = test.values[-1, :-1]
y_test = test.values[-1, -1]
# Get the remaining dataset
X = train.values[:-1, :-1]
y = train.values[:-1, -1]
print("X")
#print(type(X))
#print(X)
print("y")
#print(type(y))
#print(y)
# Split the remaining dataset into train and calibration sets.
X_train, X_cal, y_train, y_cal = train_test_split(X, y)
print(X_train.shape, y_train.shape)
print(X_cal.shape, y_cal.shape)
Hopefully by row.

From my understanding, you wish to split the data into train and test sets according to an observation's Data Split value. After which, you will again split the train set into train and calibration. The standard data preprocessing methodology involves creating our features, X and our target, y.
# Get dataframes of train and test features
X_train = df[df['Data Split'] == 'Train'].drop(columns = ['Target']).to_numpy()
X_test = df[df['Data Split'] == 'Test'].drop(columns = ['Target']).to_numpy()
# Get arrays of train and test targets
y_train = df[(df['Data Split'] == 'Train')]["Target"].to_numpy()
y_test = df[(df['Data Split'] == 'Test')]["Target"].to_numpy()
# Split the train dataset further into train and validation/calibration sets.
X_train, X_cal, y_train, y_cal = train_test_split(X_train, y_train)
You now have your train, validation/calibration and test sets in array form.
If you wish to preserve the Target variable, simply
train = df[df['Data Split'] == 'Train'].to_numpy()
test = df[df['Data Split'] == 'Test'].to_numpy()

Related

How to get a specific row for testing and other for training?

I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.

How to pass different set of data to train and test without splitting a dataframe. (python)?

I have gone through multiple questions that help divide your dataframe into train and test, with scikit, without etc.
But my question is I have 2 different csvs ( 2 different dataframes from different years). I want to use one as train and other as test?
How to do so for LinearRegression / any model?

Load the datasets individually.
Make sure they are in the same format of rows and columns (features).
Use the train set to fit the model.
Use the test set to predict the output after training.
# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Split features and value
# when trying to predict column "target"
X_train, y_train = train.drop("target"), train["target"]
X_test, y_test = test.drop("target"), test["target"]
# Fit (train) model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict
pred = reg.predict(X_test)
# Score
accuracy = reg.socre(X_test, y_test)

please skillsmuggler what about the X_train and X_Test how I can define it because when I try to do that it said NameError: name 'X_train' is not defined

I couldn't edit the first answer which is almost there. There is some code missing though...
# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y_train = train[:, :1] #if y is only one column
X_train = train[:, 1:]
# Fit (train) model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict
pred = reg.predict(X_test)
# Score
accuracy = reg.socre(X_test, y_test)

Splitting dataset for training and testing row wise

I want to split my dataset into training and test datasets based on years. The idea is to put the rows with years ranging form 2009-2017 in train dataset and the 2018 data in test dataset. Splitting the datasets was easy for the most part but my models are throwing a lot of indexing issues.
X = ((df[df['Year'] < 2018]))
X_train = np.array(X.drop(['Usage'], 1))
X_test = np.array(X['Usage'])
y =((df[df['Year'] > 2017]))
y_train = np.array(y.drop(['Usage'], 1))
y_test = np.array(y['Usage'])
This is how I plan on splitting the data. The usage column is my forecast column and contains continuous values. Applying a simple RandomForestRegressor() gave me this error in return
ValueError: Number of labels=14495 does not match number of samples=382772
aditya my regressor model was pretty basic but i'm attaching the code any way. the columns being passed in X are as follows: X= [Cust_Id', 'Usage', 'Plan_Group', 'Contract_Type', 'Cust_Status','Premise_Zip', 'Year', 'Month']
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# evaluate predictions
print(model.score(X_test, y_test))
# accuracy = accuracy_score(y_test, (y_pred < 0.5).astype(int))

For most of the algorithms in sklearn stack, you have a standard notation:
X, capital letter, is usually an array (even if there is one feature) and represents each data point in vector form.
y, small letter, is usually a vector that denotes labels, e.g. class label, or value of a regression element.
You created X and y both as a dataframe generated by the Year attribute. Instead you have to split into X_train and X_test.
X = df.drop(['Usage'],1)
X_train = df[df['Year'] < 2018]
X_test = df[df['Year'] > 2017]
y_train = df[df['Year'] < 2018]
y_test = df[df['Year'] > 2017]
y_train = y_train['Usage']
y_test = y_test['Usage']
And then you train on the basis of X_train and y_train
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
This is not the best way though. Will come back to edit the answer but this should be enough to get you going for now.

Randomize the splitting of data for training and testing for this function

I wrote a function to split numpy ndarrays x_data and y_data into training and test data based on a percentage of the total size.
Here is the function:
def split_data_into_training_testing(x_data, y_data, percentage_split):
number_of_samples = x_data.shape[0]
p = int(number_of_samples * percentage_split)
x_train = x_data[0:p]
y_train = y_data[0:p]
x_test = x_data[p:]
y_test = y_data[p:]
return x_train, y_train, x_test, y_test
In this function, the top part of the data goes to the training dataset and the bottom part of the data samples go to the testing dataset based on percentage_split. How can this data split be made more randomized before being fed to the machine learning model?

Assuming there's a reason you're implementing this yourself instead of using sklearn.train_test_split, you can shuffle an array of indices (this leaves the training data untouched) and index on that.
def split_data_into_training_testing(x_data, y_data, split, shuffle=True):
idx = np.arange(len(x_data))
if shuffle:
np.random.shuffle(idx)
p = int(len(x_data) * split)
x_train = x_data[idx[:p]]
x_test = x_data[idx[p:]]
... # Similarly for y_train and y_test.
return x_train, x_test, y_train, y_test

You can create a mask with p randomly selected true elements and index the arrays that way. I would create the mask by shuffling an array of the available indices:
ind = np.arange(number_of_samples)
np.random.shuffle(ind)
ind_train = np.sort(ind[:p])
ind_test = np.sort(ind[p:])
x_train = x_data[ind_train]
y_train = y_data[ind_train]
x_test = x_data[ind_test]
y_test = y_data[ind_test]
Sorting the indices is only necessary if your original data is monotonically increasing or decreasing in x and you'd like to keep it that way. Otherwise, ind_train = ind[:p] is just fine.

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
To merge these predictions back with the original df, I try this:
df['y_hats'] = y_hats
But that raises:
ValueError: Length of values does not match length of index
I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

I have the same problem (almost)
I fixed it this way
...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hats = model.predict(X_test)
y_hats = pd.DataFrame(y_hats)
df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

You can create a y_hat dataframe copying indices from X_test then merge with the original data.
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)
Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

Try this:
y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2

You can probably make a new dataframe and add to it the test data along with the predicted values:
data['y_hats'] = y_hats
data.to_csv('data1.csv')

predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'],
index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True,
right_index=True)

This worked well for me. It maintains the indexing positions.
pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)

Here is a solution that worked for me:
It consists of building, for each of your folds/iterations, one dataframe which includes observed and predicted values for your test set; this way, you make use of the index (ID) contained in y_true, which should correspond to your subjects' IDs (in my code: 'SubjID').
You then concatenate the DataFrames that you generated (through 5 folds of test data in my case) and paste them back into your original dataset.
I hope this helps!
FoldNr = 0
for train_index, test_index in skf.split(X, y):
FoldNr = FoldNr + 1
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# [...] your model
# performance is measured on test set
y_true, y_pred = y_test, clf.predict(X_test)
# Save predicted values for each test set
a = pd.DataFrame(y_true).reset_index()
b = pd.Series(y_pred, name = 'y_pred')
globals()['ObsPred_df' + str(FoldNr)] = a.join(b)
globals()['ObsPred_df' + str(FoldNr)].set_index('SubjID', inplace=True)
# Create dataframe with observed and predicted values for all subjects
ObsPred_Concat = pd.concat([ObsPred_df1, ObsPred_df2, ObsPred_df3, ObsPred_df4, ObsPred_df5])
original_df['y_pred'] = ObsPred_Concat['y_pred']

First you need to convert y_val or y_test data into the DataFrame.
compare_df = pd.DataFrame(y_val)
then just create a new column with predicted data.
compare_df['predicted_res'] = y_pred_val
After that, you can easily filter the data that shows you which data is matching with original prediction based on a simple condition.
test_df = compare_df[compare_df['y_val'] == compare_df['predicted_res'] ]

you can also use
y_hats = model.predict(X)
df['y_hats'] = y_hats.reset_index()['name of the target column']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Seperate Dataframe by label (convert dataframe into numpy array) - python

Related

How to get a specific row for testing and other for training?

How to pass different set of data to train and test without splitting a dataframe. (python)?

Splitting dataset for training and testing row wise

Randomize the splitting of data for training and testing for this function

Merging results from model.predict() with original pandas DataFrame?

Categories

Resources