Transformed Target Regressor with Predict Parameters - python

I am using Scikit Learn and Gaussian Processes Regression for a problem as well as the built in Transformed Target Regressor function.
The issue I face is GPR allows for predictions to be returned with their standard deviations. In this case the estimator actually returns a tuple with two numpy arrays (one for the mean, the other for std). However the transformed target regressor function only expects a numpy array and therefore breaks when using the predict method with 'return_std=True'.
I have dropped in a really simple example to demonstrate this. Its meant to be representative of an actual problem hence the inclusion of a pipeline however with no pre-processing steps. There are also some lines commented out that would demonstrate how the predict method works without the transformed target regressor.
Would like to hear if there is anyway around this short of implementing the transformer on the predictions myself manually.
#%% Imports
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, DotProduct, RationalQuadratic, Matern)
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
#%% Generate Data
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]
#%% Fit Model
kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
# Standard Estimator
# estimator = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Transformed Estimator
estimator = TransformedTargetRegressor(
regressor = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9),
transformer=PowerTransformer(method='yeo-johnson')
)
pipe = Pipeline(
steps=[
("estimator", estimator)
]
)
pipe.fit(X_train, y_train)
#%% Predict
# No parameters - Prediction returns numpy array
# pipe.predict(X)
# Std Parameter - Prediction returns tuple of numpy arrays
mean_prediction, std_prediction = pipe.predict(X, return_std=True)

Related

How do I manually `predict_proba` from logistic regression model in scikit-learn?

I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())

Why LightGBM Python-package gives bad prediction using for regression task?

I have a sample time-series dataset (23, 208), which is a pivot table count for 24hrs count for some users; I was experimenting with different regressors from sklearn which work fine (except for SGDRegressor()), but this LightGBM Python-package gives me very linear prediction as follows:
my tried code:
import pandas as pd
dff = pd.read_csv('ex_data2.csv',sep=',')
dff.set_index("timestamp",inplace=True)
print(dff.shape)
from sklearn.model_selection import train_test_split
trainingSetf, testSetf = train_test_split(dff,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
username = 'MMC_HEC_LVP' # select one column for plotting & check regression performance
user_list = []
for column in dff.columns:
user_list.append(column)
index = user_list.index(username)
X_trainf = trainingSetf.iloc[:,:].values
y_trainf = trainingSetf.iloc[:,:].values
X_testf = testSetf.iloc[:,:].values
y_testf = testSetf.iloc[:,:].values
test_set_copy = y_testf.copy()
model_LGBMRegressor = MultiOutputRegressor(lgb.LGBMRegressor()).fit(X_trainf, y_trainf)
pred_LGBMRegressor = model_LGBMRegressor.predict(X_testf)
test_set_copy[:,[index]] = pred_LGBMRegressor[:,[index]]
#plot the results for selected user/column
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12, 10))
plt.xlabel("Date")
plt.ylabel("Values")
plt.title(f"{username} Plot")
plt.plot(trainingSetf.iloc[:,[index]],label='trainingSet')
plt.plot(testSetf.iloc[:,[index]],"--",label='testSet')
plt.plot(test_set_copy[:,[index]],'b--',label='RF_predict')
plt.legend()
So what I am missing is if I use default (hyper-)parameters?
Short Answer
Your dataset has a very small number of rows, and LightGBM's parameters have default values set to provide good performance on medium-sized datasets.
Set the following parameters to force LightGBM to fit to the provided data.
min_data_in_bin = 1
min_data_in_leaf = 1
Long Answer
Before training, LightGBM does some pre-processing on the input data.
For example:
bundling sparse features
binning continuous features into histograms
dropping features which are guaranteed to be uninformative (for example, features which are constant)
The result of that preprocessing is a LightGBM Dataset object, and running that preprocessing is called Dataset "construction". LightGBM performs boosting on this Dataset object, not raw data like numpy arrays or pandas data frames.
To speed up construction and prevent overfitting during training, LightGBM provides ability to the prevent creation of histogram bins that are too small (min_data_in_bin) or splits that produce leaf nodes which match too few records (min_data_in_leaf).
Setting those parameters to very low values may be required to train on small datasets.
I created the following minimal, reproducible example, using Python 3.8.12, lightgbm==3.3.2, numpy==1.22.2, and scikit-learn==1.0.2 demonstrating this behavior.
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
# 20-row input data
X, y = make_regression(
n_samples=20,
n_informative=5,
n_features=5,
random_state=708
)
# training produces 0 trees, and predicts mean(y)
reg = LGBMRegressor(
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.000
# training fits and predicts well
reg = LGBMRegressor(
min_data_in_bin=1,
min_data_in_leaf=1,
num_boost_round=20,
verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.985
If you use LGBMRegressor(min_data_in_bin=1, min_data_in_leaf=1) in the code in the original post, you'll see predictions that better fit to the provided data.
In this way the model is overfitted!
If you do a random split after creating the dataset and evaluate the model on the test dataset, you will notice that the performance is essentially the same or worse (as in this example).
# SETUP
# =============================================================
from lightgbm import LGBMRegressor
from sklearn.metrics import r2_score
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(
n_samples=200, n_informative=10, n_features=40, random_state=123
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
# =============================================================
# TEST 1
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X, y)
print(f"r2 (defaults): {r2_score(y, reg.predict(X))}")
# 0.815
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X, y)
print(f"r2 (small min_data): {r2_score(y, reg.predict(X))}")
# 0.974
# =============================================================
# TEST 2
reg = LGBMRegressor(num_boost_round=20, verbosity=0)
reg.fit(X_train, y_train)
print(f"r2 (defaults): {r2_score(y_train, reg.predict(X_train))}")
# 0.759
reg = LGBMRegressor(
min_data_in_bin=1, min_data_in_leaf=1, num_boost_round=20, verbosity=0
)
reg.fit(X_train, y_train)
print(f"r2 (small min_data): {r2_score(y_test, reg.predict(X_test))}")
# 0.219

Use GaussianProcessRegressor with precomputed kernel in scikit-learn

I am trying to use the GaussianProcessRegressor in scikit-learn with some graph kernels computed by the grakel software. Below is my code for a 5-fold cross-validation on 100 graph data. For the sake of testing convenience, I have commented out all graph-related lines and use random kernel matrices and y values instead.
from sklearn.model_selection import KFold
from sklearn.utils import check_random_state
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.metrics import mean_squared_error
#from grakel.kernels import WeisfeilerLehman
import numpy as np
def Kfold_CV_GPR(Gs, y, n_iter=4, n_splits=5, random_state=None):
random_state = check_random_state(random_state)
kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
errors = []
for train_idxs, test_idxs in kf.split(y):
# gk = WeisfeilerLehman(n_iter=n_iter, normalize=True)
# K_train = gk.fit_transform(Gs[train_idxs])
# K_test = gk.transform(Gs[test_idxs])
K_train = np.random.randn(80, 80)
K_test = np.random.randn(20, 80)
gpr = GPR(kernel='precomputed')
gpr.fit(K_train, y[train_idxs])
y_pred = gpr.predict(K_test)
rmse = mean_squared_error(y[test_idxs], y_pred, squared=False)
errors.append(rmse)
return -np.mean(errors)
score = Kfold_CV_GPR(Gs=None, y=np.random.randn(100, ), n_iter=4, n_splits=5)
print(score)
However, I am getting the following error
TypeError: Cannot clone object ''precomputed'' (type <class 'str'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
When I change sklearn.gaussian_process.GaussianProcessRegressor to sklearn.svm.SVR (support vector regression), my code doesn't throw any error, but it will run forever for some reason. I also tested classifers like sklearn.svm.SVC and my code is working fine.
Anyone know how to use precomputed kernel in a scikit-learn's GaussianProcessRegressor?

sklearn stratified k-fold CV with linear model like ElasticNetCV

using cross validation (CV) with sklearn is quite easy and straight-forward. But the default implementation when setting cv=5 in a linear CV model, like ElasticNetCV or LassoCV is a KFold CV. For various reasons I'd like to use a StratifiedKFold. From the documentation, it seems like any CV method can be given with cv=.
Passing cv=KFold(5) works as expected, but cv=StratifiedKFold(5) raises the Error:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I know that I can use cross_val_score after fitting, but I'd like to pass StratifiedKFold as CV directly to the linear model.
My minimum working example is:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # THIS RAISES THE ERROR
Any idea how I can set StratifiedKFold as CV directly?
The root of your problem is this line:
y = np.arange(100) + np.random.rand(100)
StratifiedKFold cannot sample from continuous distribution hence your error. Try changing this line and your code will execute happily:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.random.choice([0,1], size=100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # no ERROR
NOTE
If you sample on continuous data, use KFold. If your target is categorical you may use both KFold and StratifiedKFold whichever suits your needs.
NOTE 2
If you insist on emulating stratified sampling on continuous data, you may wish to apply pandas.cut to your data, then do stratified sampling on that data, and finally pass resulting (train_id, test_id) generator to cv param:
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
y_cat = pd.cut(y, 10, labels=range(10))
skf_gen = StratifiedKFold(5).split(x, y_cat)
model_skf = ElasticNetCV(cv=skf_gen)
model_skf.fit(x, y) # no ERROR

When do items in the Pipeline call fit_transform(), and when do they call transform()? (scikit-learn, Pipeline)

I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.

Categories

Resources