Nonsensical Confusion Matrix for ANN - python

I have used the following methods to attempt to create an ANN model. However, my classification matrix (at the bottom) strongly indicates something has gone amiss. However, I am not sure where the problem started and why.
The dataset I used was split as such:
X = df.drop('Recurrence', axis = 1)
y = df['Recurrence']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
This was the train/test method:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
# Set seed for reproducibility
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
clf = MLPClassifier(hidden_layer_sizes=(100), activation='logistic', solver='lbfgs', learning_rate= 'adaptive',
random_state=SEED, max_iter=200).fit(X_train, y_train)
classificationSummary(y_train, clf.predict(X_train))
classificationSummary(y_test, clf.predict(X_test))
The following classification summary was given
Confusion Matrix (Accuracy 1.0000)
Prediction
Actual 0
0 1
Confusion Matrix (Accuracy 0.0000)
Prediction
Actual 0 1
0 0 1
1 0 0

Related

how to input the model into the KNN classification algorithm?

I want to make image clasification using KNN. i use https://pythonprogramming.net/loading-custom-data-deep-learning-python-tensorflow-keras/ to make a model. i have 20 image which 10 image in dog category and 10 image in cat category. I'm having trouble entering the model into the KNN algorithm,there is a problem in my coding. this is my code:
knn_model=KNeighborsClassifier(n_neighbors=3) #define K=3
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
predict_knn=knn_model.predict(X_test)
print(predict_knn)
there is an error : found input variables with inconsistent numbers of samples: [60, 20]
I need your opinion how to fix this code. thank you.
The problem could be due to the inconsistent sample size of X and y.
1. len(y) == 20
# Works
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(20*32*32*3).reshape((20, 32, 32, 3)), list(range(20))
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
2. len(y) == 60
# Does not work
X, y = np.arange(20*32*32*3).reshape((20, 32, 32, 3)), list(range(60))
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
The second script produces the below error.

How do I return the result of each cross validation prediction

I have a task that requires me to analyse a model but I need the output predictions for each cross validation step- and the data that the cross validation used in that step.
Here is my code:
results= cross_validate(MLPClassifier, X_train, y_train, cv=5,return_estimator = True)
Which did not work. Also,
results= cross_val_predict(MLPClassifier, X_train, y_train, cv=5)
Neither worked, however the second method gave me the a set of predictions that are the shape of y_train (labels). However I expected a smaller value to be returned say 10% the size of y_train.
Also I'm unsure how to obtain the data used for each cross validation step.
How about using one of the Cross Validation iterators?
from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit
from sklearn.neural_network import MLPClassifier
X, y = make_classification(n_samples=1000, random_state=0)
datasets = {} # [(X_train, y_train), (X_test, y_test)]
results = {}
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for idx, (train_index, test_index) in enumerate(ss.split(X)):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
datasets[f"train_{idx}"] = X_train, y_train
datasets[f"test_{idx}"] = X_test, y_test
model = MLPClassifier(random_state=0).fit(X_train, y_train)
results[f"accuracy_{idx}"] = model.score(X_test, y_test)
results
Output:
{'accuracy_0': 0.968,
'accuracy_1': 0.924,
'accuracy_2': 0.94,
'accuracy_3': 0.944,
'accuracy_4': 0.964}

How to fetch R2 for each target in sklearn MultiOutputRegressor() rather than the overall R2?

For example, Xs has 5 independent variables, and Ys has 5 dependent variables:
x_train, x_test, y_train, y_test = train_test_split(Xs, Ys, test_size=0.2, random_state=2)
model = lgb.LGBMRegressor()
wrapper = MultiOutputRegressor(model)
model.fit(x_train, y_train)
model.score(x_test, y_test)
Could only get the overall R2 through the code above, what if I want to check the R2 for each Y?
Is it possible?
Thanks
You can use scikit-learn r2_score with multioutput='raw_values':
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import r2_score
import lightgbm as lgb
# generate the data
X, Y = make_regression(n_targets=5, n_features=10, n_samples=1000, random_state=42)
# split the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# instantiate the model
model = MultiOutputRegressor(estimator=lgb.LGBMRegressor())
# fit the model
model.fit(X_train, Y_train)
# generate the model predictions
Y_pred = model.predict(X_test)
# calculate the individual R2's
print(r2_score(Y_test, Y_pred, multioutput='raw_values'))
# [0.907924 0.925267 0.906492 0.939653 0.881619]
print([r2_score(Y_test[:, i], Y_pred[:, i]) for i in range(Y_test.shape[1])])
# [0.907924, 0.925267, 0.906492, 0.939653, 0.881619]
# calculate the overall R2
print(model.score(X_test, Y_test))
# 0.9121908184618046
print(r2_score(Y_test, Y_pred, multioutput='uniform_average'))
# 0.9121908184618046

Build SVMs with different kernels (RBF)

I have this in python
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# The gamma parameter is the kernel coefficient for kernels rbf/poly/sigmoid
svm = SVC(gamma='auto', probability=True)
svm.fit(X_train,y_train.values.ravel())
prediction = svm.predict(X_test)
prediction_prob = svm.predict_proba(X_test)
print('Accuracy:', accuracy_score(y_test,prediction))
print('AUC:',roc_auc_score(y_test,prediction_prob[:,1]))
print(X_train)
print(y_train)
Now I want to build this with a different kernel rbf and store the values into arrays.
so something like this
def svm_grid_search(parameters, cv):
# Store the outcome of the folds in these lists
means = []
stds = []
params = []
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
for parameter in parameters:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# The gamma parameter is the kernel coefficient for kernels rbf/poly/sigmoid
svm = SVC(gamma=1,kernel ='rbf',probability=True)
svm.fit(X_train,y_train.values.ravel())
prediction = svm.predict(X_test)
prediction_prob = svm.predict_proba(X_test)
return means, stddevs, params
I know I want to loop around the parameters and then store the values into the lists.
But I struggle how to do so ...
So what I try to do is to loop and then store the results of the SVM in the arrays with the
kernel = parameter
I would be very thankful if you could help me out here.
This is what GridSearchCV is for. Link here
See here for an example

How to compute "y_train_true, y_train_prob, y_test_true, y_test_prob"?

I have computed X_train, X_test, y_train, y_test. But I can not compute y_train_true, y_train_prob, y_test_true, y_test_prob.
How can I compute y_train_true, y_train_prob, y_test_true, y_test_prob from the following code ?
X_train:
X_test:
y_train:
y_test:
N.B,
y_train_true: True binary labels of 0 or 1 in the training dataset
y_train_prob: Probability in range {0,1} predicted by the model for the training dataset
y_test_true: True binary labels of 0 or 1 in the testing dataset
y_test_prob: Probability in range {0,1} predicted by the model for the testing dataset
Code :
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier and ====
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
# knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_train)
Well in your case y_train and y_test is already y_train_true and y_test_true. To get y_train_prob and y_test_prob, you need to take a model. I don't know which dataset you're using but it seems to be a binary classification problem so that you could use logistic regression to do this so,
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
y_train_prob = knn.predict_proba(X_train)
y_test_prob = knn.predict_proba(X_test)

Categories

Resources