Explanation of the parameters of knn.score - python

In the code below (last line) it is used the X_test and y_test which according to the docs:
Returns the mean accuracy on the given test data and label
The question is what exactly is calculated since X_test has the data from the test data and the y_test has the labels for these data.
It would make sense to check the predicted labels vs the actual labels.
Can you tell me how the first scenario works in the last line?
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'],
random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

If you check the docs:
Parameters:
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns:
score : float
Mean accuracy of self.predict(X) wrt. y.
What happens in the back?
It is predicting: knn.predict(X_test) and then using these values to calculate the mean accuracy between knn.predict(X_test) and y_test.
You should get the same output with sklearn.metrics.accuracy_score:
from sklearn.metrics import accuracy_score
y_predict = knn.predict(X_test)
accuracy_score(y_test, y_predict)

Related

"Expected sequence or array-like, got DecisionTreeClassifier" from sklearn

x_train,x_test,y_train,y_test=train_test_split(X,y,random_state=22)
model=DecisionTreeClassifier()
y_pred=model.fit(x_train,y_train)
print("precision: ",precision_score(y_test,y_pred))
Output:
type error
Expected sequence or array-like, got <class 'sklearn.tree._classes.DecisionTreeClassifier'>
model.fit() does not return the predictions. It fits the model and returns the model itself. I guess what you want is something like this instead:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("precision: ", precision_score(y_test, y_pred))
I've followed the convention to capitalize X, because it is a 2-dimensional matrix, while y is a one-dimensional vector.

What KNN accuracy is shown here?

I use the knn algorithm from the sklearn library. In the last code line, I get a result for accurency with 95,5%. Does that mean, that 95,5% of my test dataset X_test is right/correct predicted?
Here I short part of my script.
# Model, predict and solve
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
How can I see the wrong predicted rows?
Thank you!
Your predictions are in the Y_pred variable. Print it along with the y_test variable. You can then compare and see which observations are correctly predicted and which are incorrectly classified.
PS: From your code snippet, its not clear whether you do have y_test variable.

DecisionTreeRegressor score not calculated

I'm trying to calculate the score of a DecisionTreeRegressor with the following code:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree
# some features are better using LabelEncoder like HouseStyle but the chance that they will affect
# the target LotFrontage are small so we just use HotEncoder and drop unwanted columns later
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood',
'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
encoded_df = encoded_df[['LotFrontage', 'LotArea', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3',
'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside']]
# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use the mean value)
encoded_df['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X = encoded_df.drop('LotFrontage', axis=1)
y = encoded_df['LotFrontage'].astype('int32')
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = DecisionTreeRegressor()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_test = y_test.values.reshape(-1, 1)
classifier.score(y_test, y_pred)
print("Accuracy is: ", accuracy_score(y_test, y_pred) * 100)
when it's gets to calculating the score of the model I get the following error:
ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 1
Not sure as to why it happens because according sklearn docs the Test Samples are to be in the shape of (n_samples, n_features)
and y_test is indeed in this shape:
y_test.shape # (365, 1)
and the True labels should be in the shape of (n_samples) or (n_samples, n_outputs) and y_pred is indeed in this shape:
y_pred.shape # (365,)
The dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The first argument of the score function shouldn't be the target value of the test set, it should be the input value of the test set, so you should do
classifier.score(X_test, y_test)

How to get the log loss?

I am playing with the Leaf Classification data set and I am struggling to compute the log loss of my model after testing it. After importing it from the metrics class here I do:
# fitting the knn with train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Optimisation via gridSearch
knn=KNeighborsClassifier()
params={'n_neighbors': range(1,40), 'weights':['uniform', 'distance'], 'metric':['minkowski','euclidean'],'algorithm': ['auto','ball_tree','kd_tree', 'brute']}
k_grd=GridSearchCV(estimator=knn,param_grid=params,cv=5)
k_grd.fit(X_train,y_train)
# testing
yk_grd=k_grd.predict(X_test)
# calculating the logloss
print (log_loss(y_test, yk_grd))
However, my last line is resulting in the following error:
y_true and y_pred contain different number of classes 93, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true.
But the when I run the following:
X_train.shape, X_test.shape, y_train.shape, y_test.shape, yk_grd.shape
# results
((742, 192), (248, 192), (742,), (248,), (248,))
what am i missing really?
From sklearn.metrics.log_loss documentantion:
y_pred : array-like of float, shape = (n_samples, n_classes) or
(n_samples,)
Predicted probabilities, as returned by a classifier’s predict_proba method.
Then, to get log loss:
yk_grd_probs = k_grd.predict_proba(X_test)
print(log_loss(y_test, yk_grd_probs))
If you still get an error, it means that a specific class is missing in y_test.
Use:
print(log_loss(y_test, yk_grd_probs, labels=all_classes))
where all_classes is a list containing all the classes in your dataset.

Scaling of validation data in supervised ML algorithm

I have written a classification algorithm in Python. It satisfies to Scikit-Learn's API. Given labeled data X, y, I would like to train my algorithm on this data in the following way:
X, y are split into X_aux, y_aux and X_test, y_test.
X_aux, y_aux are split into X_train, y_train and X_val, y_val.
Then, using Scikit-Learn, I define a Pipeline which is the concatenation of a StandardScaler (for feature normalization) and my model. Eventually, the pipeline is trained and evaluated as follows:
pipe = Pipeline([('scaler', StandardScaler()), ('clf', Model())])
pipe.fit(X_train, y_train, validation_data = (X_val, y_val))
pred_proba = pipe.predict_proba(X_test)
score = roc_auc_score(y_test, pred_proba)
The fit method of Model accepts a validation_data parameter to monitor progress during training and possibly avoid overfitting. To this aim, at each iteration, the fit method prints the model loss on training data (X_train, y_train) (training loss) and model loss on validation data (X_val, y_val) (validation loss). In addition to validation loss, I also would like the fit method to return ROC AUC score on validation data. My question is the following :
Shall X_val be normalized with the scaler of the pipeline before it is used to compute validation ROC AUC score during training ? Also, in this code, only X_train is normalized by the scaler. Should I do X_aux = scaler.fit_transform(X_aux) instead and then split into train/validation ?
I apologize in advance for my question is very naive. I confess I got confused.
I think that X_val should be normalized. The way I see it is that the few lines of code above are equivalent to:
scaler = StandardScaler()
clf = Model()
X_train = scaler.fit_transform(X_train)
clf.fit(X_train, y_train, validation_data = (X_val, y_val))
# During `fit`, at each iteration, we would have:
# train_loss = loss(X_train, y_train)
# validation_loss = loss(X_val, y_val)
# pred_proba_val = predict_proba(X_val, y_val) (*)
# roc_auc_val = roc_auc_score(y_val, pred_proba_val)
X_test = scaler.transform(X_test)
pred_proba = clf.predict_proba(X_test) (**)
score = roc_auc_score(y_test, pred_proba)
In line (*) the predict_proba method is called on unnormalized data whereas it is called on normalized data on line (**). This is why I believe that X_val should be normalized. Still I am not sure whether my thinking is correct.

Categories

Resources