I am trying to use the GaussianProcessRegressor in scikit-learn with some graph kernels computed by the grakel software. Below is my code for a 5-fold cross-validation on 100 graph data. For the sake of testing convenience, I have commented out all graph-related lines and use random kernel matrices and y values instead.
from sklearn.model_selection import KFold
from sklearn.utils import check_random_state
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.metrics import mean_squared_error
#from grakel.kernels import WeisfeilerLehman
import numpy as np
def Kfold_CV_GPR(Gs, y, n_iter=4, n_splits=5, random_state=None):
random_state = check_random_state(random_state)
kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
errors = []
for train_idxs, test_idxs in kf.split(y):
# gk = WeisfeilerLehman(n_iter=n_iter, normalize=True)
# K_train = gk.fit_transform(Gs[train_idxs])
# K_test = gk.transform(Gs[test_idxs])
K_train = np.random.randn(80, 80)
K_test = np.random.randn(20, 80)
gpr = GPR(kernel='precomputed')
gpr.fit(K_train, y[train_idxs])
y_pred = gpr.predict(K_test)
rmse = mean_squared_error(y[test_idxs], y_pred, squared=False)
errors.append(rmse)
return -np.mean(errors)
score = Kfold_CV_GPR(Gs=None, y=np.random.randn(100, ), n_iter=4, n_splits=5)
print(score)
However, I am getting the following error
TypeError: Cannot clone object ''precomputed'' (type <class 'str'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
When I change sklearn.gaussian_process.GaussianProcessRegressor to sklearn.svm.SVR (support vector regression), my code doesn't throw any error, but it will run forever for some reason. I also tested classifers like sklearn.svm.SVC and my code is working fine.
Anyone know how to use precomputed kernel in a scikit-learn's GaussianProcessRegressor?
Related
I am using Scikit Learn and Gaussian Processes Regression for a problem as well as the built in Transformed Target Regressor function.
The issue I face is GPR allows for predictions to be returned with their standard deviations. In this case the estimator actually returns a tuple with two numpy arrays (one for the mean, the other for std). However the transformed target regressor function only expects a numpy array and therefore breaks when using the predict method with 'return_std=True'.
I have dropped in a really simple example to demonstrate this. Its meant to be representative of an actual problem hence the inclusion of a pipeline however with no pre-processing steps. There are also some lines commented out that would demonstrate how the predict method works without the transformed target regressor.
Would like to hear if there is anyway around this short of implementing the transformer on the predictions myself manually.
#%% Imports
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, DotProduct, RationalQuadratic, Matern)
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
#%% Generate Data
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]
#%% Fit Model
kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
# Standard Estimator
# estimator = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Transformed Estimator
estimator = TransformedTargetRegressor(
regressor = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9),
transformer=PowerTransformer(method='yeo-johnson')
)
pipe = Pipeline(
steps=[
("estimator", estimator)
]
)
pipe.fit(X_train, y_train)
#%% Predict
# No parameters - Prediction returns numpy array
# pipe.predict(X)
# Std Parameter - Prediction returns tuple of numpy arrays
mean_prediction, std_prediction = pipe.predict(X, return_std=True)
I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())
I am using shap library for ML interpretability to better understand k-means segmentation algorithm clusters. In a nutshell I make some blogs, use k-means to cluster them and then take the clusters as label and xgboost to try to predict them. I have 5 clusters so it is a signle-label multi-class classification problem.
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
X, y = make_blobs(n_samples=500, centers=5, n_features=5, random_state=0)
data = pd.DataFrame(np.concatenate((X, y.reshape(500,1)), axis=1), columns=['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'cluster_id'])
data['cluster_id'] = data['cluster_id'].astype(int).astype(str)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.iloc[:,:-1])
kmeans = KMeans(n_clusters=5, **kmeans_kwargs)
kmeans.fit(scaled_features)
data['predicted_cluster_id'] = kmeans.labels_.astype(int).astype(str)
clf = xgb.XGBClassifier()
clf.fit(scaled_data.iloc[:,:-1], scaled_data['predicted_cluster_id'])
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(scaled_data.iloc[0,:-1].values.reshape(1,-1))
shap.force_plot(explainer.expected_value[0], shap_values[0], link='logit') # repeat changing 0 for i in range(0, 5)
The pictures above make sense as the class is '3'. But why this base_value, shouldn't it be 1/5? I asked myself a while ago a similar question but this time I set already link='logit'.
link="logit" does not seem right for multiclass, as it's only suitable for binary output. This is why you do not see probabilities summing up to 1.
Let's streamline your code:
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
from scipy.special import softmax, logit, expit
np.random.seed(42)
X, y_true = make_blobs(n_samples=500, centers=5, n_features=3, random_state=0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=5)
y_predicted = kmeans.fit_predict(X_scaled, )
clf = xgb.XGBClassifier()
clf.fit(X_scaled, y_predicted)
shap.initjs()
Then, what you see as expected values in:
explainer = shap.TreeExplainer(clf)
explainer.expected_value
array([0.67111245, 0.60223354, 0.53357694, 0.50821152, 0.50145331])
are base scores in raw space.
The multi-class raw scores can be converted to probabilities with softmax:
softmax(explainer.expected_value)
array([0.22229282, 0.20749694, 0.19372895, 0.18887673, 0.18760457])
shap.force_plot(..., link="logit") doesn't make sense for multiclass, and it seems impossible to switch from raw to probability and still maintain additivity (because softmax(x+y) ≠ softmax(x) + softmax(y)).
Should you wish to analyze your data in probability space try KernelExplainer:
from shap import KernelExplainer
masker = shap.maskers.Independent(X_scaled, 100)
ke = KernelExplainer(clf.predict_proba, data=masker.data)
ke.expected_value
# array([0.18976762, 0.1900516 , 0.20042894, 0.19995041, 0.21980143])
shap_values=ke.shap_values(masker.data)
shap.force_plot(ke.expected_value[0], shap_values[0][0])
or summary plot:
from shap import Explanation
shap.waterfall_plot(Explanation(shap_values[0][0],ke.expected_value[0]))
which are now additive for shap values in probability space and align well with both base probabilities (see above) and predicted probabilities for 0th datapoint:
clf.predict_proba(masker.data[0].reshape(1,-1))
array([[2.2844513e-04, 8.1287889e-04, 6.5225776e-04, 9.9737883e-01,
9.2762709e-04]], dtype=float32)
I want to check my loss values using MSE during the training process, how to fetching the loss values using MSE at each of iteration?., thank you.
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error
dataset = open_dataset("forex.csv")
dataset_vector = [float(i[-1]) for i in dataset]
normalized_dataset_vector = normalize_vector(dataset_vector)
training_vector, validation_vector, testing_vector = split_dataset(training_size, validation_size, testing_size, normalized_dataset_vector)
training_features = get_features(training_vector)
training_fact = get_fact(training_vector)
validation_features = get_features(validation_vector)
validation_fact = get_fact(validation_vector)
model = MLPRegressor(activation=activation, alpha=alpha, hidden_layer_sizes=(neural_net_structure[1],), max_iter=number_of_iteration, random_state=seed)
model.fit(training_features, training_fact)
pred = model.predict(training_features)
err = mean_absolute_error(pred, validation_fact)
print(err)
There's no callbacks object like there is in Keras so you'll have to loop over the fitting process to get it for each iteration. Something like the below will work for you
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_absolute_error
# create some toy data
X = np.random.random((100, 5))
y = np.random.choice([0, 1], 100)
max_iter = 500
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=max_iter)
errors = []
for i in range(max_iter):
mlp.partial_fit(X, y, classes=[0, 1])
pred = mlp.predict(X)
errors.append(mean_absolute_error(y, pred))
Which throws an annoying DeprecationWarning at the moment, but that can be ignored. The only problem with using this method is that you have to manually keep track of whether or not your model has converged. Personally I would suggest using Keras instead of sklearn if you want to work with neural networks.
I am trying to estimate the confusion matrix of a classifier using 10-fold cross-validation with sklearn.
To compute the confusion matrix I am using sklearn.metrics.confusion_matrix. I know that I can evaluate a model with cv using sklearn.model_selection.cross_val_score and sklearn.metrics.make_scorer like:
from sklearn.metrics import confusion_matrix, make_scorer
from sklearn.model_selection import cross_val_score
cm = cross_val_score(clf, X, y, make_scorer(confusion_matrix))
Where clf is my classifier and X, y the feature and class vectors. However, this will raise an error since confusion_matrix does not return a float number but a matrix.
I've tried doing something like:
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
def cv_confusion_matrix(clf, X, y, folds=10):
skf = StratifiedKFold(n_splits=folds)
cv_iter = skf.split(X, y)
cms = []
for train, test in cv_iter:
clf.fit(X[train,], y[train])
cm = confusion_matrix(y[test], clf.predict(X[test]), labels=clf.classes_)
cms.append(cm)
return np.mean(np.array(cms), axis=1)
This will work, but I missing the parallelism that sklearn has with cross_val_score and the n_jobs parameter.
Is there any way to do this and to take the advantage of the parallelism?