Context
I use sklearn machine learning algorithms like SVR for a regression-task.
from sklearn.svm import SVR
model = SVR(kernel='poly', degree=2, epsilon=.5)
model.fit(
features # Numpy array with features
, target # Numpy array with the target
)
Afterwards I return the score of the regression using the .score()-function.
Additionally, I need the prediction-results using .predict() for further processing.
some_data = [...] # Numpy array with some data to predict
correct_targets = [...] # Numpy array with targets according to some data
# Get R²
print("R²:", model.score(
some_data
, correct_targets
))
# Store prediction
pred = model.predict(some_data)
Question
When I run the code in the above version the model is calculated twice - once for .score() and once for .predict().
However, I cannot run the .score() on the saved .predict().
This is a bit nasty since the calculation takes some time.
Is it possible to store the prediction and apply .score() afterwards without recalculating?
If you already have the predicted values:
pred = model.predict(some_data)
and the respective ground truth correct_targets, it is straightforward to get the R^2 score without re-running the model, as scikit-learn has a dedicated function for this:
from sklearn.metrics import r2_score
r2_score(correct_targets, pred)
Related
I am using Scikit Learn and Gaussian Processes Regression for a problem as well as the built in Transformed Target Regressor function.
The issue I face is GPR allows for predictions to be returned with their standard deviations. In this case the estimator actually returns a tuple with two numpy arrays (one for the mean, the other for std). However the transformed target regressor function only expects a numpy array and therefore breaks when using the predict method with 'return_std=True'.
I have dropped in a really simple example to demonstrate this. Its meant to be representative of an actual problem hence the inclusion of a pipeline however with no pre-processing steps. There are also some lines commented out that would demonstrate how the predict method works without the transformed target regressor.
Would like to hear if there is anyway around this short of implementing the transformer on the predictions myself manually.
#%% Imports
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, DotProduct, RationalQuadratic, Matern)
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
#%% Generate Data
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]
#%% Fit Model
kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
# Standard Estimator
# estimator = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Transformed Estimator
estimator = TransformedTargetRegressor(
regressor = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9),
transformer=PowerTransformer(method='yeo-johnson')
)
pipe = Pipeline(
steps=[
("estimator", estimator)
]
)
pipe.fit(X_train, y_train)
#%% Predict
# No parameters - Prediction returns numpy array
# pipe.predict(X)
# Std Parameter - Prediction returns tuple of numpy arrays
mean_prediction, std_prediction = pipe.predict(X, return_std=True)
I want to use sklearn.metrics.confusion_matrix(y_true, y_pred) to create a confusion matrix for a keras model.
After training a model I can use predict_generator(generator) to get predictions for a test dataset, which gives me y_pred. How can I get the corresponding true labels, y_true from a data generator?
generator.classes will give you observed values in sparse format. You probably need it in dense (i.e., one-hot encoded format). You could get that with:
import pandas as pd
pd.get_dummies(pd.Series(generator.classes)).to_dense()
NOTE though: you must set the generator's shuffle attribute to False before generating the predictions and fetching the observed classes, otherwise your predictions and observations will not line up!
After creating a data generator, either your own or the built in ImageDataGenerator, use your trained model to make predictions:
true_labels = data_generator.classes
predictions = model.predict_generator(data_generator)
sklearn's confusion matrix expects a 1-d array of labels, so you have to convert your predictions using np.argmax()
y_true = true_labels
y_pred = np.array([np.argmax(x) for x in predictions])
Then you can use those variables directly in the confusion_matrix function
cm = sklearn.metrics.confusion_matrix(y_true, y_pred)
And you can plot it using the example plot_confusion_matrix() function found here:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
I am trying to calculate the prediction probability. I have wrote a program which is calculating but speed is very slow and taking so much time for large dataset.
The aim is to calculate each prediction probability in the SVM model by using LinearSVC and OneVsRestClassifier but getting the error
AttributeError: 'LinearSVC' object has no attribute 'predict_proba'
Due to the above error, I have tried below
Code
from sklearn import svm
model_1 = svm.SVC(kernel='linear', probability=True)
from sklearn.preprocessing import LabelEncoder
X_1 = df["Property Address"]
lb = LabelEncoder()
X_2 = lb.fit_transform(X_1)
y_1 = df["Location_Name"]
y_2 = lb.fit_transform(y_1)
test_1 = test["Property Address"]
lb = LabelEncoder()
test_1 = lb.fit_transform(test_1)
X_2= X_2.reshape(-1, 1)
y_2= y_2.reshape(-1, 1)
test_1 = test_1.reshape(-1, 1)
model_1.fit(X_2, y_2)
results = model_1.predict_proba(test_1)[0]
# gets a dictionary of {'class_name': probability}
prob_per_class_dictionary = dict(zip(model.classes_, results))
Is there any other way for the same task? please suggest
You could use sklearns CalibratedClassifierCV if you need to use to the predict_proba method.
Or you could use Logistic Regression.
If your issue is related to speed, try consider using the LinearSVC in sklearn.svm instead of SVC(kernel='linear'). It is faster.
As suggested in another answer, LinearSVC is faster than SVC(kernel='linear').
Regarding probability, SVC doesn't have predict_proba(). Instead, you have to set its probability hyperparameter to True. Link
Tip: SVM is preferred for small datasets, so prefer to use other algorithms to handle large datasets.
I am taking dataquest.io and I observed something strange (but could not get any answer back there). I am wondering why I can't use a code snippet that worked before in a situation that use the same kind/type of data, and should not cause an exception.
The lesson first teach to fit a regressor on a same training set and to predict on the same values, the calculating MSE.
Then it shows that it would overfit and propose a randomization process to avoid that. Problem being, apart from the random splitting, the dataframes generated are very similar, but if I try to calculate my MSE on the final results, it fails poorly, and I have to change the code for an alternative.
Here are both codes:
First code
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Initialize the linear regression class.
regressor = LinearRegression()
# We're using 'value' as a predictor, and making predictions for 'next_day'.
# The predictors need to be in a dataframe.
# We pass in a list when we select predictor columns from "sp500" to
# force pandas not to generate a series.
# (?) I could not figure out why it is not necessary for "to_predict"
predictors = sp500[["value"]]
to_predict = sp500["next_day"]
# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)
# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(predictors)
print(next_day_predictions)
MSE_frame=(next_day_predictions-to_predict)**2
#(?) can math.pow(frame_difference, 2) be used on a dataframe?
mse=MSE_frame.sum()/len(MSE_frame.index)
______________________________________________________________________________
Second code
import numpy as np
import random
# Set a random seed to make the shuffle deterministic.
np.random.seed(1)
random.seed(1)
#(?) are there any difference between both of these statements? Are they
# both necessary or just one out of two?
# Randomly shuffle the rows in our dataframe
sp500 = sp500.loc[np.random.permutation(sp500.index)]
# Select 70% of the dataset to be training data
highest_train_row = int(sp500.shape[0] * .7)
train = sp500.loc[:highest_train_row,:]
# Select 30% of the dataset to be test data.
test = sp500.loc[highest_train_row:,:]
regressor = LinearRegression()
regressor.fit(train[["value"]], train["next_day"])
predictions = regressor.predict(test[["value"]])
mse = sum((predictions - test["next_day"]) ** 2) / len(predictions)
regressor = LinearRegression()
predictors = train[["value"]]
to_predict = train["next_day"]
# Train the linear regression model on our dataset.
regressor.fit(predictors, to_predict)
# Generate a list of predictions with our trained linear regression model
next_day_predictions = regressor.predict(test[["value"]])
print(next_day_predictions)
sqr=(next_day_predictions-test["next_day"])**2
Mistake was here, I was passing a with test[["next_day"]] while it was not done in the first code. Stupid me
mse=sum(sqr)/len(sqr.index)
#or
mse=sqr.sum()/len(sqr.index)
# This is the line which failed while it was identical to what was
#done before.
** it is worth noting both mse expressions don't yield the same results, They are identical for first ten decimals, but comparison with == doesn't give True.
So, the problem was there:
sqr=(next_day_predictions-test["next_day"])**2
I originally wrote
sqr=(next_day_predictions-test[["next_day"]])**2
thus passing a list into calculation, which was not done in the first code.
I plan on using scikit svm for class prediction.
I have a two-class dataset consisting of about 100 experiments. Each experiment encapsulates my data-points (vectors) + classification.
Training of an SVM according to http://scikit-learn.org/stable/modules/svm.html should straight forward.
I will have to put all vectors in an array and generate another array with the corresponding class labels, train SVM. However, in order to run leave-one-out error estimation, I need to leave out a specific subset of vectors - one experiment.
How do I achieve that with the available score function?
Cheers,
EL
You could manually train on everything but the one observation, using numpy indexing to drop it out. Then you can use any of sklearn's helpers to evaluate the classification. For example:
import numpy as np
from sklearn import svm
clf = svm.SVC(...)
idx = np.arange(len(observations))
preds = np.zeros(len(observations))
for i in idx:
is_train = idx != i
clf.fit(observations[is_train, :], labels[is_train])
preds[i] = clf.predict(observations[i, :])
Alternatively, scikit-learn has a helper to do leave-one-out, and another helper to get cross-validation scores:
from sklearn import svm, cross_validation
clf = svm.SVC(...)
loo = cross_validation.LeaveOneOut(len(observations))
was_right = cross_validation.cross_val_score(clf, observations, labels, cv=loo)
total_acc = np.mean(was_right)
See the user's guide for more. cross_val_score actually returns a score for each fold (which is a little strange IMO), but since we have one fold per observation, this will just be 0 if it was wrong and 1 if it was right.
Of course, leave-one-out is very slow and has terrible statistical properties to boot, so you should probably use KFold instead.