Relate the predicted value to it index/identification number - python

I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?

I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.

Related

Best practice for train, validation and test set

I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:
X = df.drop(['target'], axis=1)
y=df[['target']]
X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index,
test_size=0.2,
random_state=10,
stratify=y,
shuffle=True)
df_=df.iloc[indices_train]
X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]
X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index,
test_size=0.15,
random_state=10,
stratify=y_,
shuffle=True)
df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]
What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?
a faster and optimal solution if dataset is large would be using numpy.
How to split data into 3 sets (train, validation and test)?
or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.
So, I did a dummy dataset of 100 points.
I separate the data and I did the first split:
X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.
X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)
Now you need to split again for validation, so you can do it like this:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)
Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:
X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)
At the end you have 70 points for training, 15 for testing and 15 for validation.
Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.

How to use cross validation with sm.GLM (sm = statsmodels.api)?

How to use cross validation with sm.GLM (sm = statsmodels.api)?
I am trying to fill the "model" parameter from the cross_val_score. However, since I need to use the sm.GLM I don't know how to use it.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 1)
n = 5
df_cut, bins = pd.cut(X_train, n, retbins=True, right=True)
df_steps = pd.concat([X_train, df_cut, y_train], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 5 bins
bin_mapping = np.digitize(X_test, bins, right=True)
X_valid = pd.get_dummies(bin_mapping)
# Removing any outliers
# X_valid = pd.get_dummies(bin_mapping).drop([6], axis=1)# Prediction
pred2 = fit3.predict(X_valid)
# Calculating RMSE
rms = sqrt(mean_squared_error(y_test, pred2))
print(rms)
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
cv=cv, n_jobs=-1)
Therefore, instead of computing the RMSE like above, I would like to use the cross_val_score. For example, in the model parameter, if I would like to use lasso I would put model = lasso(). However, here I can not put model = sm.GLM().
I hope it is clear...

I want to compare my prediction value with original train data

I am trying to learn decision tree regressor and I have wrote below code.
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size = 0.3, random_state = 100)
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
I want to create a dataframe which include X_test and Y_test and Y_pred.
Is there any method or function for that.
Append the below code at the end of your prediction code:
final_df = X_test.copy()
final_df["Y_original"] = y_test
final_df["Y_predicted"] = y_pred
Here we are creating a new dataframe namely final_df and putting all the values you require into it. Would not suggest you to directly append values into X_test, as it might be needed for use again for prediction.

What do the results on a Sci-Kit machine learning program represent?

I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows:
(75, 4) (75, 4)
0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?
You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.

How to get a specific row for testing and other for training?

I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.

Categories

Resources