I am playing with the Leaf Classification data set and I am struggling to compute the log loss of my model after testing it. After importing it from the metrics class here I do:
# fitting the knn with train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Optimisation via gridSearch
knn=KNeighborsClassifier()
params={'n_neighbors': range(1,40), 'weights':['uniform', 'distance'], 'metric':['minkowski','euclidean'],'algorithm': ['auto','ball_tree','kd_tree', 'brute']}
k_grd=GridSearchCV(estimator=knn,param_grid=params,cv=5)
k_grd.fit(X_train,y_train)
# testing
yk_grd=k_grd.predict(X_test)
# calculating the logloss
print (log_loss(y_test, yk_grd))
However, my last line is resulting in the following error:
y_true and y_pred contain different number of classes 93, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true.
But the when I run the following:
X_train.shape, X_test.shape, y_train.shape, y_test.shape, yk_grd.shape
# results
((742, 192), (248, 192), (742,), (248,), (248,))
what am i missing really?
From sklearn.metrics.log_loss documentantion:
y_pred : array-like of float, shape = (n_samples, n_classes) or
(n_samples,)
Predicted probabilities, as returned by a classifier’s predict_proba method.
Then, to get log loss:
yk_grd_probs = k_grd.predict_proba(X_test)
print(log_loss(y_test, yk_grd_probs))
If you still get an error, it means that a specific class is missing in y_test.
Use:
print(log_loss(y_test, yk_grd_probs, labels=all_classes))
where all_classes is a list containing all the classes in your dataset.
Related
I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models.
Lets say that I have dataset of 1000 rows.
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
Now I want to split this data into training and testing (33% of data for testing, 67% for training):
x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
I have read on the internet that fitting the data into model should look like this:
history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)
So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2.
So basically, my model will be trained with 80% of data, and tested on 20% of data.
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50)
Is this correct?
I mean, why should I split the data into training and testing, where does x_train and y_train go?
Can you please explain to me whats the correct order of steps for creating model?
Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.
Now, when you used
X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things
use the (X_test and y_test as validation set in model.fit(...). Or,
use them for final prediction in model. predict(...)
So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:
model.fit(x=X_train, y=y_trian,
validation_data = (X_test, y_test), ...)
In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).
Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:
model.fit(x=X_train, y=y_trian,
validation_split = 0.2, ...)
The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:
y_pred = model.predict(x_test, batch_size=50)
Now, you can compare with y_test and y_pred with some relevant metrics.
Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:
history = model.fit(X_train, y_train, batch_size=50)
While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.
There is a related question here:
https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work
Keras documentation:
https://keras.rstudio.com/reference/fit.html
I have read on the internet that fitting the data into model should
look like this:
That means you need to fit features and labels. You already split them into x_train & y_train. So your fit should look like this:
history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?
That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:
y_hat = model.predict(X_test)
Then you can compare y_hat with y_test, i.e get a confusion matrix etc.
I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.
You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.
In the code below (last line) it is used the X_test and y_test which according to the docs:
Returns the mean accuracy on the given test data and label
The question is what exactly is calculated since X_test has the data from the test data and the y_test has the labels for these data.
It would make sense to check the predicted labels vs the actual labels.
Can you tell me how the first scenario works in the last line?
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'],
random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
If you check the docs:
Parameters:
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns:
score : float
Mean accuracy of self.predict(X) wrt. y.
What happens in the back?
It is predicting: knn.predict(X_test) and then using these values to calculate the mean accuracy between knn.predict(X_test) and y_test.
You should get the same output with sklearn.metrics.accuracy_score:
from sklearn.metrics import accuracy_score
y_predict = knn.predict(X_test)
accuracy_score(y_test, y_predict)
I am using LinearRegression(). Below you can see what I have already done to predict new features:
lm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=say)
lm.fit(X_train, y_train)
lm.predict(X_test)
scr = lm.score(X_test, y_test)
lm.fit(X, y)
pred = lm.predict(X_real)
Do I really need the line lm.fit(X, y) or can I just go without using it? Also, If I don't need to calculate accuracy, do you think the following approach is better instead using training and testing? (In case I don't want to test):
lm.fit(X, y)
pred = lm.predict(X_real)
Even I am getting 0.997 accuraccy, the predicted value is not close or shifted. Are there ways to make prediction more accurate?
You don't need to fit multiple times for predicting a value by given features since your algorithm already learned your train set. Check the codes below.
# Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
# Teach your data to your algorithm with train set
lr = LinearRegression()
lr.fit(X_train, y_train)
# Now it can predict
y_pred = lr.predict(X_test)
# Use test set to see how accurate it predicts
lr_score = lr.score(y_pred, y_test)
The reason you are getting almost 100% accuracy score is a data leakage, caused by the following line of code:
lm.fit(X, y)
in the line above you gave your model ALL the data and then you are testing prediction using the subset of data that your model has already seen.
This causes very high accuracy score for the already seen data, but usually it performs badly on the unseen data.
When do you want / need to fit your model multiple times?
If you are getting a new training data and want to improve your model by training it against a new portion of data, then you may want to choose one of regression algorithm, supporting incremental-learning.
In this case you will use model.partial_fit() method instead of model.fit()...
I am trying to learn Keras GraphNN using simple examples. I have a simple example dataset with 784 features and I want to run this example:
# graph model with one input and two outputs
graph = Graph()
graph.add_input(name='input', input_shape=(784,))
graph.add_node(Dense(input_dim=784, output_dim=13), name='dense1', input='input')
graph.add_node(Dense(input_dim=13, output_dim=1), name='dense2', input='input')
graph.add_node(Dense(input_dim=13, output_dim=1), name='dense3', input='dense1')
graph.add_output(name='output1', input='dense2')
graph.add_output(name='output2', input='dense3')
graph.compile('rmsprop', {'output1': 'mse', 'output2': 'mse'})
graph.fit({'input': X_train, 'output1': y_train, 'output2': y_train}, nb_epoch=30)
### here is where I am facing difficulty
score = graph.evaluate({'input': X_test, 'output1': y_test, 'output2': y_test}, batch_size=16, verbose=1)
print 'score: ', score
The documentation mentions that graph.evaluate():
evaluate(data, batch_size=128, verbose=1): Show performance of the model over some validation data.
Return: The loss score over the data.
Arguments: Same meaning as fit method above. verbose is used as a binary flag (progress bar or nothing).
And from the definition of the graph.fit() we know that:
Arguments:
data:dictionary mapping input names out outputs names to appropriate numpy arrays. All arrays should contain the same number of samples.
Although my fit method runs perfect, I get IndexError: index 1 is out of bounds for size 1 on evaluate
My input shapes are:
Xtrain: (32738, 784)
Xtest: (16125, 784)
ytest: (16125,)
ytrain: (32738,)
What am I missing here ?