What do the results on a Sci-Kit machine learning program represent? - python

I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows:
(75, 4) (75, 4)
0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?

You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)

What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.

Related

How to apply KNN from large datatset to small dataset or to just one test data

I have trained and tested a KNN model on a supervised dataset of about 180 samples (6 classes of 30 samples each) in Python. I would like to apply these results to a small unsupervised dataset of 21 samples (3 classes of 7 samples).
The problem is datasets have different number of raws. So either I getting an error with inconsistent numbers of samples, or matching target in a new datasets and getting not representative result.
I want to see which classes datas from new small dataset corespond in large dataset. Is there a way to do that?
Here is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import utils
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
clf = KNeighborsClassifier()
for key in data:
scores = cross_val_score(clf, data[key], y, cv=5)
print("Accuracy for {:5s} : {:0.2f} (+/- {:0.2f})".format(
key, scores.mean(), scores.std() * 2))
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('small dataset')
X = df.drop(columns=['subject', 'sessionIndex', 'rep'])
y = df['subject']
Y = pd.get_dummies(y).values
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
n_neighbors = [2, 3, 4, 5, 6]
parameters = dict(n_neighbors=n_neighbors)
clf = KNeighborsClassifier()
grid = GridSearchCV(clf, parameters, cv=5)
grid.fit(X_train, Y_train)
results = grid.cv_results_
for i in range(1, 4):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {}".format(results['params'][candidate]))
print()
from sklearn.metrics import accuracy_score, roc_curve, auc
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
So I'm getting an array [[0 0 1]] which is correct, only it doesn't check any classes in large dataset of 6 classes like if I matching X and Y to datas from it, not from small dataset
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
X = data['large dataset']
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
This way the result an a array of 6 numbers like [[0 0 0 0 0 1]]. And I want to see the same when testing new small dataset.

Why does sklearn give me performance metrics so different from the predictions that it makes on all the other permutations of the same data?

As part of a classification model, I initially used seed 101 to generate my train and test data and calibrate my model. Then I printed the classification report and I got a recall of 0.46 for the case=1.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=101)
model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
classification_report(y_test,y_pred)
The next piece of code uses the calibration of the model in the previous step and tests its performance against different permutations of the test data. On each iteration I use a different random_state to create the test data.
results=[]
for i in range(0,n):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=i) #Note random state will be different on each iteration
predictions = model.predict(X_test)
class_report=classification_report(y_test,predictions,output_dict=True) #Function to produce the classification report
recall_precision=get_recall_precision(class_report) #Function that extracts recall and precision for the dependent variable=1
results.append(recall_precision) #Append recall and precision to a list on each iteration
#Divide recall and precision of dependent variable=1 in two different lists
recall= [x[0] for x in results]
precision= [x[1] for x in results]
#Plot recall against precision in a scatter plot
plt.scatter(precision,recall)
plt.xlabel('Precision',fontsize = 16)
plt.ylabel("Recall",fontsize = 16)
See below the scatter plot of the results for the precision and recall of the positive values (dependent variable=1). Note how my initial prediction is at the bottom left of the plot and all the rest seem to lie in a better performance state. This does not seem right. How can this be happening?

Repeated holdout method

How can I make "Repeated" holdout method, I made holdout method and get accuracy but need to repeat holdout method for 30 times
There is my code for holdout method
[IN]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y.values.ravel(), random_state=100)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
[OUT]
Accuracy: 49.62%
I see many codes for repeated method but only for K fold cross, nothing for holdout method
So to use a repeated holdout you could use the ShuffleSplit method from sklearn. A minimum working example (following the name conventions that you used) might be as follows:
from sklearn.modelselection import ShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create some artificial data to train on, can be replace by your own data
X, Y = make_classification()
rs = ShuffleSplit(n_splits=30, test_size=0.25, random_state=100)
model = LogisticRegression()
for train_index, test_index in rs.split(X):
X_train, Y_train = X[train_index], Y[train_index]
X_test, Y_test = X[test_index], Y[test_index]
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.2f%%" % (result*100.0))
n_splits determines how many time you would like to repeat the holdout. test_size deterimines the fraction of samples that is sampled as a test set. In this case 75% is sampled as train set, whereas 25% is sampled to your test set. For reproducible results you can set the random_state (any number suffices, as long as you use the same number consistently).

How to get a specific row for testing and other for training?

I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.

How to compute "y_train_true, y_train_prob, y_test_true, y_test_prob"?

I have computed X_train, X_test, y_train, y_test. But I can not compute y_train_true, y_train_prob, y_test_true, y_test_prob.
How can I compute y_train_true, y_train_prob, y_test_true, y_test_prob from the following code ?
X_train:
X_test:
y_train:
y_test:
N.B,
y_train_true: True binary labels of 0 or 1 in the training dataset
y_train_prob: Probability in range {0,1} predicted by the model for the training dataset
y_test_true: True binary labels of 0 or 1 in the testing dataset
y_test_prob: Probability in range {0,1} predicted by the model for the testing dataset
Code :
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier and ====
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
# knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_train)
Well in your case y_train and y_test is already y_train_true and y_test_true. To get y_train_prob and y_test_prob, you need to take a model. I don't know which dataset you're using but it seems to be a binary classification problem so that you could use logistic regression to do this so,
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
y_train_prob = knn.predict_proba(X_train)
y_test_prob = knn.predict_proba(X_test)

Categories

Resources