TransformedTargetRegressor does not inherit feature_importances_ attribute - python

I am using TransformedTargetRegressor to transform my target into log space. It is done like
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.compose import TransformedTargetRegressor
clf = TransformedTargetRegressor(regressor=GradientBoostingRegressor(**params),
func=np.log1p, inverse_func=np.expm1)
However when I later call
feature_importance = clf.feature_importances_
I get
AttributeError: 'TransformedTargetRegressor' object has no attribute
'feature_importances_'
I would have thought that all the attributes of the original class would be inherited. How can this be solved?
For further context here is an official example. Replacing the initialization line with mine would result in the crash.

As TransformedTargetRegressor Doc says, one can access its component regressor through .regressor_.
So this is what you want:
clf.regressor_.feature_importances_
Workable code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.compose import TransformedTargetRegressor #only in sklearn==0.20.2
# #############################################################################
# Load data
boston = datasets.load_boston()
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
# #############################################################################
# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
'learning_rate': 0.01, 'loss': 'ls'}
#clf = ensemble.GradientBoostingRegressor(**params)
clf = TransformedTargetRegressor(regressor=GradientBoostingRegressor(**params),
func=np.log1p, inverse_func=np.expm1)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)
print(clf.regressor_.feature_importances_)
Its output:
MSE: 7.7145
[6.45223704e-02 1.32970011e-04 2.92221184e-03 4.48101769e-04
3.57392613e-02 2.02435922e-01 1.22755948e-02 7.03996426e-02
1.54903176e-03 1.90771421e-02 1.98577625e-02 1.63376111e-02
5.54302378e-01]

Related

Hyperparameter tuning for StackingRegressor sklearn

In my problem, I would like to tune sklearn.ensemble.StackingRegressor using a simple RandomizedSearchCV tuner. Since we need to define estimators while instantiating StackingRegressor(), I couldn't defined parameter space for estimators in my param_distribution randomizedsearch properly.
I tried the following and I faced with error:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import RidgeCV
from sklearn.svm import LinearSVR
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor,
GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor
X, y = load_diabetes(return_X_y=True)
rfr = RandomForestRegressor()
gbr = GradientBoostingRegressor()
estimators = [rfr, gbr]
sreg = StackingRegressor(estimators=estimators)
params = {'rfr__max_depth': [3, 5, 10, 100],
'gbr__max_depth': [3, 5, 10, 100]}
grid = RandomizedSearchCV(estimator=sreg,
param_distributions=params,
cv=3)
grid.fit(X,y)
and I faced with errors AttributeError: 'RandomForestRegressor' object has no attribute 'estimators_'.
Is there anyway to tune parameters of different estimators within the StackingRegressor?
If you define your estimators as a list of tuples of estimator names and estimator instances as shown below your code should work.
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor
X, y = load_diabetes(return_X_y=True)
rfr = RandomForestRegressor()
gbr = GradientBoostingRegressor()
estimators = [('rfr', rfr), ('gbr', gbr)]
sreg = StackingRegressor(estimators=estimators)
params = {
'rfr__max_depth': [3, 5],
'gbr__max_depth': [3, 5]
}
grid = RandomizedSearchCV(
estimator=sreg,
param_distributions=params,
n_iter=2,
cv=3,
verbose=1,
random_state=100
)
grid.fit(X, y)
res = pd.DataFrame(grid.cv_results_)
print(res)
# mean_fit_time std_fit_time ... std_test_score rank_test_score
# 0 1.121728 0.024188 ... 0.024546 2
# 1 1.096936 0.034377 ... 0.013047 1

Calculating AUC for LogisticRegression model

Let's take data
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
I want to create model using only first principal component and calculate AUC for it.
My work so far
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
But while I'm trying to use
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
Following error occurs :
y should be a 1d array, got an array of shape (569, 2) instead.
I tried to reshape my data
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
But it didn't solve the issue (it outputs) :
multilabel-indicator format is not supported
Do you have any idea how can I perform AUC on this first principal component?
You may wish to try:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
The problem is that predict_proba returns a column for each class. Generally with binary classification, your classes are 0 and 1, so you want the probability of the second class, so it's quite common to slice as follows (replacing the last line in your code block):
pred = clf.predict_proba(principalDf)[:, 1]

sklearn.exceptions.NotFittedError: Estimator not fitted, call `fit` before exploiting the model

I tried Random Forests regression.
The code is given below.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
np.random.seed(0)
d1 = np.random.randint(2, size=(50, 10))
d2 = np.random.randint(3, size=(50, 10))
d3 = np.random.randint(4, size=(50, 10))
Y = np.random.randint(7, size=(50,))
X = np.column_stack([d1, d2, d3])
n_smples, n_feats = X.shape
print (n_smples, n_feats)
kf = KFold(n_splits=5, shuffle=True, random_state=0)
regr = RandomForestRegressor(max_features=None,random_state=0)
pipe = make_pipeline(RFECV(estimator=regr, step=3, cv=kf, scoring =
'neg_mean_squared_error', n_jobs=-1),
GridSearchCV(regr, param_grid={'n_estimators': [100, 300]},
cv=kf, scoring = 'neg_mean_squared_error',
n_jobs=-1))
ypredicts = cross_val_predict(pipe, X, Y, cv=kf, n_jobs=-1)
rmse = mean_squared_error(Y, ypredicts)
print (rmse)
However, I got the following error:
sklearn.exceptions.NotFittedError: Estimator not fitted, call fit before exploiting the model.
I also tried:
model = pipe.fit(X,Y)
ypredicts = cross_val_predict(model, X, Y, cv=kf, n_jobs=-1)
But got the same error.
Edit 1:
I also tried:
pipe.fit(X,Y)
But got the same error.
In Python 2.7 (Sklearn 0.20), for the same code I got different error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
In Python 2.7 (Sklearn 0.20.3):
NotFittedError: Estimator not fitted, callfitbefore exploiting the model.
It seems that you are trying to select best parameters for your classifier by using grid search their is a another to do so. You are using pipelines but in this method I am not using pipeline but I am getting best parameters through random search.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
np.random.seed(0)
d1 = np.random.randint(2, size=(50, 10))
d2 = np.random.randint(3, size=(50, 10))
d3 = np.random.randint(4, size=(50, 10))
Y = np.random.randint(7, size=(50,))
X = np.column_stack([d1, d2, d3])
n_smples, n_feats = X.shape
print (n_smples, n_feats)
kf = KFold(n_splits=5, shuffle=True, random_state=0)
regr = RandomForestRegressor(max_features=None,random_state=0)
n_iter_search = 20
random_search = RandomizedSearchCV(regr, param_distributions={'n_estimators': [100, 300]},
n_iter=20, cv=kf,verbose=1,return_train_score=True)
random_search.fit(X, Y)
ypredicts=random_search.predict(X)
rmse = mean_squared_error(Y, ypredicts)
print(rmse)
print(random_search.best_params_)
random_search.cv_results_
Try this piece of code. I hope this code is fullfilling your problem.
Instead of
model = pipe.fit(X,Y)
have you tried
pipe.fit(X,Y)
instead?
so that would be
pipe.fit(X,Y)
# change model to pipe
ypredicts = cross_val_predict(pipe, X, Y, cv=kf, n_jobs=-1)

How to predict data in python after training a stacked model?

I'm new to machine learning in python and have seen the concept of stacking models and wanted to give it a shot. The problem is i don't know how predict new data as i don't fully understand machine learning implementation in python. the code that i managed scrap looks like this:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from vecstack import stacking
import pandas as pd
X = pd.read_csv('db/file_name3.csv')
y = pd.read_csv('db/train_labels(1).csv')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
models = [
CatBoostRegressor(iterations=200,
learning_rate=0.03,
depth=4,
loss_function='RMSE',
eval_metric='RMSE',
random_seed=99,
od_type='Iter',
od_wait=50,
logging_level='Silent'),
CatBoostRegressor(iterations=500,
learning_rate=0.06,
depth=3,
loss_function='RMSE',
eval_metric='RMSE',
random_seed=99,
od_type='Iter',
od_wait=50,
logging_level='Silent'),
ExtraTreesRegressor(random_state = 0, n_jobs = -1,
n_estimators = 100, max_depth = 3),
RandomForestRegressor(random_state = 0, n_jobs = -1,
n_estimators = 300, max_depth = 3),
XGBRegressor(eta=0.02,reg_lambda=5,reg_alpha=1),
XGBRegressor(eta=0.1,reg_lambda=1,reg_alpha=10),
XGBRegressor(eta=0.02,reg_lambda=1,reg_alpha=10,n_estimators=300),
XGBRegressor(eta=0.012,max_depth=3,n_estimators=200),
GradientBoostingRegressor(),
BaggingRegressor(),
]
test1= pd.read_csv('db/Cleaned Data.csv')
S_train, S_test = stacking(models, X_train, y_train, X_train,
regression = True, metric = mean_absolute_error, n_folds = 10 ,
shuffle = True, random_state = 0, verbose = 2)
model = model.fit(S_train, y_train)
y_pred = model.predict(S_test)
print(y_pred.shape)
as you can see test1 is the data that i want to predict but could't figure it out. I can predict the data from my training set but not the new one. I have not changed any of the parameters of the models from the documentation.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score
# load your X, y and test1 data here
RF = RandomForestRegressor(random_state = 0, n_jobs = -1,
n_estimators = 300, max_depth = 3)
# Validation function
n_folds = 5
def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True,
random_state=42).get_n_splits(X.values)
rmse= np.sqrt(-cross_val_score(model, X.values, y.values.ravel(),
scoring="neg_mean_squared_error", cv = kf))
return(rmse)
score = rmsle_cv(RF)
print("Random Forest score: {:.3f} ({:.3f})\n".format(score.mean(),
score.std()))
RF.fit(X,y.values.ravel())
RF_train_pred = RF.predict(X)
RF_pred = RF.predict(test1)
print (RF_pred.shape)

Scikit-Learn: Adjust train_size or test_size?

This is a question regarding best practices for sklearn.
While experimenting with SVMs using the iris dataset provided in the sklearn library. While using train_test_split, I was wondering which parameter to adjust to avoid overfitting. I was taught to adjust test_size (roughly to ~0.3), but there is a train_size parameter. Would it not make sense to adjust the train_size to avoid overfitting, or am I misunderstanding something here?
I get similar results regardless of which parameter I adjust, but I don't know if that's always the case.
Appreciate any help. Thanks!
Here is the code I am currently working with:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
scaler = StandardScaler()
scaler.fit(df)
scaled_df = scaler.transform(df)
df = pd.DataFrame(data=scaled_df, columns=iris.feature_names)
x = df
y = iris.target
#test_size is used here, but is swapped with train_size to experiment
x_train, x_test, y_train, y_test = tts(x, y, test_size=0.33)
c_param = np.arange(1, 100, 10)
gamma_param = np.arange(0.0001, 1, 0.001)
params = {'C':c_param, 'gamma':gamma_param}
grid = GridSearchCV(estimator=SVC(), param_grid=params, verbose=0)
grid_fit = grid.fit(x_train, y_train)
grid_pred = grid.predict(x_test)
print(grid.best_params_)
print('\n')
print("Number of training records: ", len(x_train))
print("Number of test records: ", len(x_test))
print('\n')
print(classification_report(y_true=y_test, y_pred=grid_pred))
print('\n')
print(confusion_matrix(y_true=y_test, y_pred=grid_pred))

Categories

Resources