How to split test and training data for decision boundary plot?

How to split test and training data for decision boundary plot? - python

I am trying to plot a decision boundary plot using some variables. I know that I need to split the data into test and training data and then plot the decision boundary, that can be done using the x_train, x_test line, however I seem to encounter a problem. Below I will show my errors as well.
def LogReg2D0732():
GDP = np.array(GDPArr) # Temp
Support = np.array(SupportArr) #Income
Score = np.array(list(map(ScoreConvert, ScoreArr))) #Sales
X = GDP
X = np.vstack((X,Support)) # var x samples
y = Score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# use sci-kit learn
clf = LogisticRegression(fit_intercept=True, C = 1e15) # instantiate
clf.fit(X.T, y) # do the fit
print("Coef ", clf.intercept_, clf.coef_) # print some attributes
xx, yy = np.mgrid[np.min(GDP):np.max(GDP):0.01, np.min(Support):np.max(Support):0.01]
gridxy = np.c_[xx.ravel(), yy.ravel()]
probs = clf.predict_proba(gridxy)[:,1].reshape(xx.shape)
f, ax = plt.subplots(figsize=(12, 8))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu", vmin=0, vmax=1)
ax_c = f.colorbar(contour) # for the probability range
ax_c.set_ticks([0, 1/4, 1/2, 3/4, 1]) # and values
idx = np.where(y==1); idx = np.reshape(idx,np.shape(idx)[1]) # tuple>list
y1 = X[:,idx]
idx = np.where(y==0); idx = np.reshape(idx,np.shape(idx)[1]) # tuple>list
y0 = X[:,idx]
ax.scatter(y1[0,:], y1[1,:], c='cyan') # light blue
ax.scatter(y0[0,:], y0[1,:], c='pink')
plt.title('GDP against Social support')
plt.xlabel('GDP')
plt.ylabel('Social Support')
plt.savefig('LR2D.svg')
plt.show()
# compute AUC
probsa = clf.predict_proba(X.T)
probsa = probsa[:, 1] # only the positive samples
auc = roc_auc_score(y, probsa)
print("Area under curve:", auc)
#fpr, tpr, thresholds = roc_curve(y, probsa)
fpr, tpr, thresholds = roc_curve(y, probsa, drop_intermediate=False)
plt.plot(fpr, tpr, color='sienna', label='ROC')
plt.plot([0, 1], [0, 1], color='seagreen', linestyle='dotted') # no skill
plt.xlabel('False Positive Rate, Specificity')
plt.ylabel('True Positive Rate, Sensitivity')
plt.title('Receiver Operating Characteristic (ROC) Plot')
plt.legend()
plt.savefig('LRauc.svg')
plt.show()
LogReg2D0732()#Logistic Regression
ValueError Traceback (most recent call last)
<ipython-input-21-31361159fc28> in <module>
48 plt.savefig('LRauc.svg')
49 plt.show()
---> 50 LogReg2D0732()#Logistic Regression
<ipython-input-21-31361159fc28> in LogReg2D0732()
8 y = Score
9
---> 10 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
11
12 # use sci-kit learn
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
2116 raise TypeError("Invalid parameters passed: %s" % str(options))
2117
-> 2118 arrays = indexable(*arrays)
2119
2120 n_samples = _num_samples(arrays[0])
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
246 """
247 result = [_make_indexable(X) for X in iterables]
--> 248 check_consistent_length(*result)
249 return result
250
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
210 if len(uniques) > 1:
211 raise ValueError("Found input variables with inconsistent numbers of"
--> 212 " samples: %r" % [int(l) for l in lengths])
213
214
ValueError: Found input variables with inconsistent numbers of samples: [2, 156]
I also do not quite understand the error which says inconsistent number of samples[2,156]. Do let me know how to solve this issue.

The error is happening during the creation of the split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The error means the shape of the arrays is not the same. Check this link for similar issues.

Related

LogisticRegression not iterating through combinations of features in a dataframe to find the best combination

I wrote a function to find the best combination of given dataframe features, f1 score, and auc score using LogisticRegression. The problem is that when I try to pass a list of dataframes combinations, using itertools combinations, LogisticRegression doesn't recognize each combination as its own X variable/ dataframe.
I'm starting with a dataframe of 10 feature columns and 10k rows. When I run the below code I get a "ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input".
def find_best_combination(X, y):
#initialize variables
best_f1 = 0
best_auc = 0
best_variables = []
# get all possible combinations of variables
for i in range(1, X.shape[1]):
for combination in combinations(X.columns, i):
X_subset = X[list(combination)]
logreg = LogisticRegression()
logreg.fit(X_subset, y)
y_pred = logreg.predict(X_subset)
f1 = f1_score(y, y_pred)
auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
# evaluate performance on current combination of variables
if f1> best_f1 and auc > best_auc:
best_f1 = f1
best_auc = auc
best_variables = combination
return best_variables, best_f1, best_auc
and the error
C:\Users\katurner\Anaconda3\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- IBE1273_01_11.0
- IBE1273_01_6.0
- IBE7808
- IBE8439_2.0
- IBE8557_7.0
- ...
warnings.warn(message, FutureWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\2\ipykernel_15932\895415673.py in <module>
----> 1 best_combo = ml.find_best_combination(X,lg_y)
2 best_combo
~\Documents\Arcadia\modeling_library.py in find_best_combination(X, y)
176 # print(y_test)
177 f1 = f1_score(y, y_pred)
--> 178 auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
179 # evaluate performance on current combination of variables
180 if f1> best_f1 and auc > best_auc:
~\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in predict_proba(self, X)
1309 )
1310 if ovr:
-> 1311 return super()._predict_proba_lr(X)
1312 else:
1313 decision = self.decision_function(X)
~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in _predict_proba_lr(self, X)
459 multiclass is handled by normalizing that over all classes.
460 """
--> 461 prob = self.decision_function(X)
462 expit(prob, out=prob)
463 if prob.ndim == 1:
~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X)
427 check_is_fitted(self)
428
--> 429 X = self._validate_data(X, accept_sparse="csr", reset=False)
430 scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
431 return scores.ravel() if scores.shape[1] == 1 else scores
~\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
598
599 if not no_val_X and check_params.get("ensure_2d", True):
--> 600 self._check_n_features(X, reset=reset)
601
602 return out
~\Anaconda3\lib\site-packages\sklearn\base.py in _check_n_features(self, X, reset)
398
399 if n_features != self.n_features_in_:
--> 400 raise ValueError(
401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input.
I'm xpecting the function to return a combination of best_variables, and accociated best_f1, best_auc.
I've also tried running the function using train, test, split. When I add train, test, split into the below code the function does run but returns "[], 0, 0" for best_variables, best_f1, best_auc.
def find_best_combination(X, y):
#initialize variables
best_f1 = 0
best_auc = 0
best_variables = []
# get all possible combinations of variables
for i in range(1, X.shape[1]):
for combination in combinations(X.columns, i):
X_subset = X[list(combination)]
X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, stratify=y, random_state=73)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
# evaluate performance on current combination of variables
if f1> best_f1 and auc > best_auc:
best_f1 = f1
best_auc = auc
best_variables = combination
return best_variables, best_f1, best_auc
I'm not sure what's going on under the hood of train, test, split that enables the function to iterate through and not error like before.
I hope this explains it enough. Thanks in advance for any help.

Representation of a training and validation metric in a pipeline

I have a problem. I want to plot my RMSE value. However, I now use a pipeline because I use cross-validation and also use other steps like feature selection.
My question is, is there a way to get this plot through the pipeline (without training the model a second time)? So how can I display the training and validation RMSE value nicely in a diagram in the pipeline?
Pipeline
dfListingsFeature_regression = pd.read_csv(r"https://raw.githubusercontent.com/Coderanker3/dataset4/main/listings_cleaned.csv")
d = {True: 1, False: 0, np.nan : np.nan}
dfListingsFeature_regression['host_is_superhost'] = dfListingsFeature_regression[
'host_is_superhost'].map(d).astype('int')
X = dfListingsFeature_regression.drop(columns=['host_id', 'id', 'price']) # Features
y = dfListingsFeature_regression['price'] # Target variable
print(dfListingsFeature_nor.shape)
steps = [('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=10000))),
('lasso', Lasso(alpha=0.4))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
y_train_predict = grid.predict(X_train)
print("Train:" , metrics.mean_squared_error(y_train, y_train_predict , squared=False))
r2 = metrics.r2_score(y_test, y_pred)
print(r2)
Plot
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
train_errors, val_errors = [], []
for m in range(1, 500 + 1):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.figure( figsize=(10,10))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
%%time
lin_reg = Lasso(alpha=0.1)
plot_learning_curves(lin_reg, X, y)
#plt.axis([0, 80, 0, 3])
plt.show()

You don't have to fit() your model again in plot_learning_curves. You can simply use your fitted pipeline to predict value for both train and validation set and then plot your learning curve.
You function should look as follow without the model.fit():
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=10)
train_errors, val_errors = [], []
for m in range(1, 500 + 1):
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.figure( figsize=(10,10))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Training set size", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
Then you should call this function using your fitted model as parameter.

"ValueError: array length 293 does not match index length 975" while applying random forest

I am trying to apply random forest and I am getting this error "ValueError: array length 293 does not match index length 975" . Please find the code snippet below.
Can anyone please help what I am doing wrong?
Code:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_tmt, test_size=0.30)
def model(estimator, idx, tr_x, tr_y, ts_x, ts_y, feature_imp = True, save_importance = True, path = None):
estimator.fit(tr_x, tr_y)
if feature_imp:
try:
importance = estimator.feature_importances_
feat_imp = pd.DataFrame({'importance':clf.feature_importances_})
feat_imp['feature'] = X_train.columns
feat_imp.sort_values(by='importance', ascending=False, inplace=True)
# feat_imp = feat_imp.iloc[:top_n]
plt.barh(feat_imp['feature'][:20],feat_imp['importance'][:20])
plt.show()
if save_importance:
feat_imp.to_csv(path)
except:
print ("Use a Tree based classifier")
preds = estimator.predict(ts_x)
preds_proba = estimator.predict_proba(ts_x)
print ("Accuracy is: %0.2f" %accuracy_score(ts_y, preds))
print ("Area under the curve: %0.2f" %roc_auc_score(ts_y, preds_proba[:,1]))
fpr, tpr, threshold = roc_curve(ts_y, preds_proba[:,1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
op = pd.DataFrame({'id': idx,
'truth': ts_y,
'probability':preds_proba[:,1],
'predicted':preds})
return op
clf = RandomForestClassifier(n_estimators=1000, n_jobs = -1,max_depth=8,
min_samples_leaf=256, random_state = 1, warm_start = True,
criterion= 'entropy', max_features = 'sqrt')
output = model(clf, X['RecordID'], X_train,y_train,
X_test, y_test,
feature_imp=True, save_importance= False, path=None)

Scaled data with SVM causing strange Confusion Matrix

I have a dataset with X.shape (104481, 34) and y.shape (104481,), and I want to train an SVM model on it.
The steps I do are (1) Split data, (2) Scale data, and (3) Train SVM:
(1) Split data:
Function:
from sklearn.model_selection import train_test_split
def split_data(X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=12, stratify=y)
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = split_data_set.split_data(X,y)
The 4 classes are the following. The data set is quite imbalanced, but that is an issue for later.
y_train.value_counts()
out:
Status_9_Substatus_8 33500
Other 33500
Status_62_Substatus_7 2746
Status_62_Substatus_30 256
Name: Status, dtype: int64
y_test.value_counts()
out:
Status_9_Substatus_8 16500
Other 16500
Status_62_Substatus_7 1352
Status_62_Substatus_30 127
Name: Status, dtype: int64
(2) Scale data:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_train_scaled.shape)
print(y_train.shape)
(3) Train and predict with SVM:
svm_method.get_svm_model(X_train_scaled, X_test_scaled, y_train, y_test)
Calling this method:
def get_svm_model(X_train, X_test, y_train, y_test):
print('Loading...')
print('Training...')
svm, y_train_pred, y_test_pred = train_svm_model(X_train,y_train, X_test)
print('Training Complete')
print('Plotting Confusion Matrix...')
performance_measure.plot_confusion_matrix(y_test,y_test_pred, normalize=True)
print('Plotting Performance Measure...')
performance_measure.get_performance_measures(y_test, y_test_pred)
return svm
Which calls this method:
def train_svm_model(X_train,y_train, X_test):
#
svm = SVC(kernel='poly', gamma='auto', random_state=12)
# Fitting the model
svm.fit(X_train, y_train)
# Predicting values
y_train_pred = svm.predict(X_train)
y_test_pred = svm.predict(X_test)
return svm, y_train_pred, y_test_pred
The resulting '''Output''' is this screenshot.
What is strange is that there are samples from all four classes present (since I used the stratify parameter when calling train_test_split), however, it looks like some of the classes disappear?
The SVM and confusion matrix functions worked well with a toy data set:
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.DataFrame(data.target)
y = np.array(y)
y = np.ravel(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
svm, y_train_pred, y_test_pred = train_svm_model(X_train, y_train, X_test)
get_svm_model(X_train, X_test, y_train, y_test)
Any idea what is going on here?
Thanks in advance.
The CM code:
def plot_confusion_matrix(y_true, y_pred,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
classes = unique_labels(y_pred)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
plt.show()
return ax

Your confusion matrix is not zero:
If we look at this on the x-axis you have the predicted label and on the y-axis the true labels, lets have a look at the third row from the top:
0.94: 0.94 of the true label: Status_62_Substatus_7 are predicted as class other, which is wrong
0.00 of the same true label are predicted also wrong
0.00 of the same true label are predicted wrong (this should be the correct predict value (higher is better)
0.06 are predicted again wrong
Is your problem is so imbalanced you just have 0 predictions for two labels.

Compute yScore of Learning Algorithm

I'm quite new to the ML python environment, I need to plot the precision/recall graph, as stated in this post: [https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html][1] you need to compute the y_score :
# Create a simple classifier
classifier = svm.LinearSVC(random_state=random_state)
classifier.fit(X_train, y_train)
y_score = classifier.decision_function(X_test)
So the question is: how can I compute the score using Multinomial NaiveBayes or LearningTree? In my code I have:
print("MultinomialNB - countVectorizer")
xTrain, xTest, yTrain, yTest=countVectorizer(db)
classifier = MultinomialNB()
model = classifier.fit(xTrain, yTrain)
yPred = model.predict(xTest)
print("confusion Matrix of MNB/ cVectorizer:\n")
print(confusion_matrix(yTest, yPred))
print("\n")
print("classificationReport Matrix of MNB/ cVectorizer:\n")
print(classification_report(yTest, yPred))
elapsed_time = time.time() - start_time
print("elapsed Time: %.3fs" %elapsed_time)
Plot function:
def plotLearningAlgorithm(yTest,yScore,algName):
precision, recall, _ = precision_recall_curve(yTest, yScore)
plt.step(recall, precision, color='b', alpha=0.2,
where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall'+ algName +'curve: AP={0:0.2f}'.format(average_precision))
Error with plot:
<ipython-input-43-d07c3365bfc2> in MultinomialNaiveBayesOPT()
11 yPred = model.predict(xTest)
12
---> 13 plotLearningAlgorithm(yTest,model.predict_proba(xTest),"MultinomialNB - countVectorizer")
14
15 print("confusion Matrix of MNB/ cVectorizer:\n")
<ipython-input-42-260aac9918f2> in plotLearningAlgorithm(yTest, yScore, algName)
1 def plotLearningAlgorithm(yTest,yScore,algName):
2
----> 3 precision, recall, _ = precision_recall_curve(yTest, yScore)
4
5 step_kwargs = ({'step': 'post'}
/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/ranking.py in precision_recall_curve(y_true, probas_pred, pos_label, sample_weight)
522 fps, tps, thresholds = _binary_clf_curve(y_true, probas_pred,
523 pos_label=pos_label,
--> 524 sample_weight=sample_weight)
525
526 precision = tps / (tps + fps)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
398 check_consistent_length(y_true, y_score, sample_weight)
399 y_true = column_or_1d(y_true)
--> 400 y_score = column_or_1d(y_score)
401 assert_all_finite(y_true)
402 assert_all_finite(y_score)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
758 return np.ravel(y)
759
--> 760 raise ValueError("bad input shape {0}".format(shape))
761
762
ValueError: bad input shape (9000, 2)
Where db contains my dataset already divided between train set and test set.
Any suggestions?
Solution:
def plot_pr(y_pred,y_true,l):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred,pos_label=l)
return precision,recall
def plotPrecisionRecall(xTest,yTest,yPred,learningName,model):
yPred_probability = model.predict_proba(xTest)
yPred_probability = yPred_probability[:,1];
no_skill_probs = [0 for _ in range(len(yTest))]
ns_precision,ns_recall,_=precision_recall_curve(yTest,no_skill_probs,pos_label="L")
precision, rec= plot_pr(yPred_probability,yTest,"L");
plt.title(learningName)
plt.plot(ns_recall,ns_precision,linestyle='--',label='No Skill')
plt.plot(rec,precision,Label='Skill')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.show()
So as It turns out the y_Pred needed to be transformed with:
yPred_probability = yPred_probability[:,1];
So big thank you to #ignoring_gravity to providing me to the right solution, I've also printed the no-skill line to extra readability to the graph.

What they call y_score is just the predicted probabilities outputted by your ML algorithm.
In multinomial nb and in a decision tree (I suppose that's what you mean by LearningTree?), you can do this with the method .predict_proba:
classifier = MultinomialNB()
model = classifier.fit(xTrain, yTrain)
yPred = model.predict_proba(xTest)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# countvectorizer is not used for train and test split, instead use train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.33, random_state=42) # here x is going to be your textual data, whereas y will be your target
countMatrix_train = vect.fit_transform(train_x) # you have to fit with your train data
countMatrix_test = vect.transform(test_x) # now have to transform( and not fit_transform) according to your train data
classifier = MultinomialNB()
classifier.fit(countMatrix_train, train_y)
ypred = classifier.predict(countMatrix_test) # this will give you class for your test data, now use this for making classification report

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split test and training data for decision boundary plot? - python

The error is happening during the creation of the split. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) The error means the shape of the arrays is not the same. Check this link for similar issues.

Related

LogisticRegression not iterating through combinations of features in a dataframe to find the best combination

Representation of a training and validation metric in a pipeline

"ValueError: array length 293 does not match index length 975" while applying random forest

Scaled data with SVM causing strange Confusion Matrix

Compute yScore of Learning Algorithm

Categories

Resources