sklearn stratified k-fold CV with linear model like ElasticNetCV - python

using cross validation (CV) with sklearn is quite easy and straight-forward. But the default implementation when setting cv=5 in a linear CV model, like ElasticNetCV or LassoCV is a KFold CV. For various reasons I'd like to use a StratifiedKFold. From the documentation, it seems like any CV method can be given with cv=.
Passing cv=KFold(5) works as expected, but cv=StratifiedKFold(5) raises the Error:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I know that I can use cross_val_score after fitting, but I'd like to pass StratifiedKFold as CV directly to the linear model.
My minimum working example is:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # THIS RAISES THE ERROR
Any idea how I can set StratifiedKFold as CV directly?

The root of your problem is this line:
y = np.arange(100) + np.random.rand(100)
StratifiedKFold cannot sample from continuous distribution hence your error. Try changing this line and your code will execute happily:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.random.choice([0,1], size=100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # no ERROR
NOTE
If you sample on continuous data, use KFold. If your target is categorical you may use both KFold and StratifiedKFold whichever suits your needs.
NOTE 2
If you insist on emulating stratified sampling on continuous data, you may wish to apply pandas.cut to your data, then do stratified sampling on that data, and finally pass resulting (train_id, test_id) generator to cv param:
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
y_cat = pd.cut(y, 10, labels=range(10))
skf_gen = StratifiedKFold(5).split(x, y_cat)
model_skf = ElasticNetCV(cv=skf_gen)
model_skf.fit(x, y) # no ERROR

Related

Transformed Target Regressor with Predict Parameters

I am using Scikit Learn and Gaussian Processes Regression for a problem as well as the built in Transformed Target Regressor function.
The issue I face is GPR allows for predictions to be returned with their standard deviations. In this case the estimator actually returns a tuple with two numpy arrays (one for the mean, the other for std). However the transformed target regressor function only expects a numpy array and therefore breaks when using the predict method with 'return_std=True'.
I have dropped in a really simple example to demonstrate this. Its meant to be representative of an actual problem hence the inclusion of a pipeline however with no pre-processing steps. There are also some lines commented out that would demonstrate how the predict method works without the transformed target regressor.
Would like to hear if there is anyway around this short of implementing the transformer on the predictions myself manually.
#%% Imports
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, DotProduct, RationalQuadratic, Matern)
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
#%% Generate Data
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]
#%% Fit Model
kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
# Standard Estimator
# estimator = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Transformed Estimator
estimator = TransformedTargetRegressor(
regressor = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9),
transformer=PowerTransformer(method='yeo-johnson')
)
pipe = Pipeline(
steps=[
("estimator", estimator)
]
)
pipe.fit(X_train, y_train)
#%% Predict
# No parameters - Prediction returns numpy array
# pipe.predict(X)
# Std Parameter - Prediction returns tuple of numpy arrays
mean_prediction, std_prediction = pipe.predict(X, return_std=True)

How do I manually `predict_proba` from logistic regression model in scikit-learn?

I am trying to manually predict a logistic regression model using the coefficient and intercept outputs from a scikit-learn model. However, I can't match up my probability predictions with the predict_proba method from the classifier.
I have tried:
from sklearn.datasets import load_iris
from scipy.special import expit
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
# use sklearn's predict_proba function
sk_probas = clf.predict_proba(X[:1, :])
# and attempting manually (using scipy's inverse logit)
manual_probas = expit(np.dot(X[:1], clf.coef_.T)+clf.intercept_)
# with a completely manual inverse logit
full_manual_probas = 1/(1+np.exp(-(np.dot(iris_test, iris_coef.T)+clf.intercept_)))
outputs:
>>> sk_probas
array([[9.81815067e-01, 1.81849190e-02, 1.44120963e-08]])
>>> manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
>>> full_manual_probas
array([[9.99352591e-01, 9.66205386e-01, 2.26583306e-05]])
I do seem to get the classes to match (using np.argmax), but the probabilities are different. What am I missing?
I've looked at this and this but haven't managed to figure it out yet.
The documentation states that
For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class
That is, in order to get the same values as sklearn you have to normalize using softmax, like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X, y)
decision = np.dot(X[:1], clf.coef_.T)+clf.intercept_
print(clf.predict_proba(X[:1]))
print(np.exp(decision) / np.exp(decision).sum())
To use sigmoids instead you can do it like this:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0, max_iter=1000, multi_class='ovr').fit(X, y) # Notice the extra argument
full_manual_probas = 1/(1+np.exp(-(np.dot(X[:1], clf.coef_.T)+clf.intercept_)))
print(clf.predict_proba(X[:1]))
print(full_manual_probas / full_manual_probas.sum())

Why such different answers for the xgboost scikit-learn interface?

I am using xgboost for the first time and trying the two different interfaces. First I get the data:
import xgboost as xgb
import dlib
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]
dmatrix = xgb.DMatrix(data=X, label=y)
Now the scikit-learn interface:
xgbr = xgb.XGBRegressor(objective='reg:squarederror', seed=20)
print(cross_val_score(xgbr, X, y, cv=5))
This outputs:
[0.73438184 0.84902986 0.82579692 0.52374618 0.29743001]
Now the xgboost native interface:
dmatrix = xgb.DMatrix(data=X, label=y)
params={'objective':'reg:squarederror'}
cv_results = xgb.cv(dtrain=dmatrix, params=params, nfold=5, metrics={'rmse'}, seed=20)
print('RMSE: %.2f' % cv_results['test-rmse-mean'].min())
This gives 3.50.
Why are the outputs so different? What am I doing wrong?
First of all, you didn't specify the metric in cross_val_score, therefore you are not calculating RMSE, but rather the estimator's default metric, which is usually just its loss function. You need to specify it for comparable results:
cross_val_score(xgbr, X, y, cv=5, scoring = 'neg_root_mean_squared_error')
Second, you need to match sklearn's CV procedure exactly. For that, you can pass folds argument to XGBoost's cv method:
from sklearn.model_selection import KFold
cv_results = xgb.cv(dtrain=dmatrix, params=params, metrics={'rmse'}, folds = KFold(n_splits=5))
Finally, you need to ensure that XGBoost's cv procedure actually converges. For some reason it only does 10 boosting rounds by default, which is too low to converge on your dataset. This is done via nrounds argument (num_boost_round if you're on an older version), I found that 100 rounds work just fine on this dataset:
cv_results = xgb.cv(dtrain=dmatrix, params=params, metrics={'rmse'}, folds = KFold(n_splits=5), nrounds = 100)
Now you will get matching results.
On a side note, it's interesting how you say it's your first time using XGBoost, but you actually have a question on XGBoost dating back to 2017.

Difference is value between xgb.train and xgb.XGBRegressor in Python for certain cases

I noticed that there are two possible implementations of XGBoost in Python as discussed here and here
When I tried running the same dataset through the two possible implementations I noticed that the results were different.
Code
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import xgboost
import pandas as pd
import numpy as np
from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)
Y = df["target"]
X = df.drop('target', axis=1)
#### Code using Native Impl for XGBoost
dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]
model = xgboost.train(dtrain=dtrain, params=params,num_boost_round=200)
predictions = model.predict(dtrain)
#### Code using Sklearn Wrapper for XGBoost
model = XGBRegressor(n_estimators = 200, max_depth=3, learning_rate =.05, min_child_weight=4, subsample=0.8 )
#model = model.fit(X, Y, eval_set = [(X, Y), (X, Y)], eval_metric = 'rmse', verbose=True)
model = model.fit(X, Y)
predictions2 = model.predict(X)
print(np.absolute(predictions-predictions2).sum())
Absolute difference sum using sklearn boston dataset
62.687134
When I ran the same for other datasets like the sklearn diabetes dataset I observed that the difference was much smaller.
Absolute difference sum using sklearn diabetes dataset
0.0011711121
Make sure random seeds are the same.
For both approaches set the same seed
param['seed'] = 123
EDIT: then there are a couple of different things.
First is n_estimators also 200? Are you imputing missing values in the second dataset also with 0? are others default values also the same(for this one I think yes because its a wrapper, but check other 2 things)
I've not set the "missing" parameter for the sklearn implementation. Once that was set the values were matching.
Also as Noah pointed out, sklearn wrapper has a few different default values which needs to be matched in order to exactly match the results.

When do items in the Pipeline call fit_transform(), and when do they call transform()? (scikit-learn, Pipeline)

I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.

Categories

Resources