Trying to do SVR for Multi-outputs - python

Since SVR supports only a single output, I am trying to employ SVR on my model which has 6 inputs and 19 outputs using MultiOutputRegressor.
I am starting with hyper-parameter tuning. However, I am getting the error below. How can I modify my code to support MultiOutputRegressor?
from sklearn.svm import SVR
from sklearn.model_selection import RandomizedSearchCV
svr = SVR()
svr_regr = MultiOutputRegressor(svr)
from sklearn.model_selection import KFold
kfold_splitter = KFold(n_splits=6, random_state = 0,shuffle=True)
#On each iteration, the algorithm will choose a difference combination of the features.
svr_random = RandomizedSearchCV(svr_regr,
param_distributions = {'kernel': ('linear','poly','rbf','sigmoid'),
'C': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10],
'degree': [3,8],
'coef0': [0.01,0.1,0.5],
'gamma': ('auto','scale')
'tol': [1e-3, 1e-4, 1e-5, 1e-6]},
n_iter=100,
cv=kfold_splitter,
n_jobs=-1,
random_state=42,
scoring='r2')
svr_random.fit(X_train, y_train)
print(svr_random.best_params_)
Error:
ValueError: Invalid parameter kernel for estimator MultiOutputRegressor(estimator=SVR()). Check the list of available parameters with `estimator.get_params().keys()`.
After getting the optimum parameters:
SVR_model = svr_regr (kernel='rbf',C=10,
coef0=0.01,degree=3,
gamma='auto',tol=1e-6,random_state=42)
SVR_model.fit(X_train, y_train)
SVR_model_y_predict = SVR_model.predict((X_test))
SVR_model_y_predict
Error after getting the optimum parameters:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_96269/769104914.py in <module>
----> 1 SVR_model = svr_regr (estimator__kernel='rbf',estimator__C=10,
2 estimator__coef0=0.01,estimator__degree=3,
3 estimator__gamma='auto',estimator__tol=1e-6,random_state=42)
4
5
TypeError: 'MultiOutputRegressor' object is not callable

I tried to reproduce a simple example of MultiOutputRegressor without using GridSearchCV (i.e. just the fit and predict methods), which seemed to work fine. The error message:
Check the list of available parameters with estimator.get_params().keys()
suggests that the parameters that you are optimising in GridSearchCV, i.e. through param_distributions, don't match the parameters accepted by MultiOutputRegressor. Looking at the API reference, there are only a few parameters that MultiOutputRegressor takes, and the parameters you are trying to pass through to SVR, e.g. C and tol belong to the support vector machine estimator.
You may be able to pass through parameters to SVR via nested parameters similar to how it's done in a pipeline.

Related

Error while doing SVR for multiple outputs

Trying to do SVR for multiple outputs. Started by hyper-parameter tuning which worked for me. Now I want to create the model using the optimum parameters but I am getting an error. How to fix this?
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputRegressor
svr = SVR()
svr_regr = MultiOutputRegressor(svr)
from sklearn.model_selection import KFold
kfold_splitter = KFold(n_splits=6, random_state = 0,shuffle=True)
svr_gs = GridSearchCV(svr_regr,
param_grid = {'estimator__kernel': ('linear','poly','rbf','sigmoid'),
'estimator__C': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10],
'estimator__degree': [3,8],
'estimator__coef0': [0.01,0.1,0.5],
'estimator__gamma': ('auto','scale'),
'estimator__tol': [1e-3, 1e-4, 1e-5, 1e-6]},
cv=kfold_splitter,
n_jobs=-1,
scoring='r2')
svr_gs.fit(X_train, y_train)
print(svr_gs.best_params_)
#print(gs.best_score_)
Output:
{'estimator__C': 10, 'estimator__coef0': 0.01, 'estimator__degree': 3, 'estimator__gamma': 'auto', 'estimator__kernel': 'rbf', 'estimator__tol': 1e-06}
Trying to create a model using the output:
SVR_model = svr_regr (kernel='rbf',C=10,
coef0=0.01,degree=3,
gamma='auto',tol=1e-6,random_state=42)
SVR_model.fit(X_train, y_train)
SVR_model_y_predict = SVR_model.predict((X_test))
SVR_model_y_predict
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_96269/769104914.py in <module>
----> 1 SVR_model = svr_regr (estimator__kernel='rbf',estimator__C=10,
2 estimator__coef0=0.01,estimator__degree=3,
3 estimator__gamma='auto',estimator__tol=1e-6,random_state=42)
4
5
TypeError: 'MultiOutputRegressor' object is not callable
Please consult the MultiOutputRegressor docs.
The regressor you got back is the model.
It is not a method, but it does offer
a bunch of fun methods that you can call,
such as .fit(), .predict(), and .score().
You are trying to specify kernel and a few
other parameters.
It appears you wanted to offer those
to SVR(), at the top of your code.

Passing parameters to a pipeline's fit() in sklearn

I have a sklearn pipeline with PolynomialFeatures() and LinearRegression() in series. My aim is to fit data to this using different degree of the polynomial features and measure the score. The following is the code I use -
steps = [('polynomials',preprocessing.PolynomialFeatures()),('linreg',linear_model.LinearRegression())]
pipeline = pipeline.Pipeline(steps=steps)
scores = dict()
for i in range(2,6):
params = {'polynomials__degree': i,'polynomials__include_bias': False}
#pipeline.set_params(**params)
pipeline.fit(X_train,y=yCO_logTrain,**params)
scores[i] = pipeline.score(X_train,yCO_logTrain)
scores
I receive the error - TypeError: fit() got an unexpected keyword argument 'degree'.
Why is this error thrown even though the parameters are named in the format <estimator_name>__<parameter_name>?
As per sklearn.pipeline.Pipeline documentation:
**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that
parameter p for step s has key s__p.
This means that the parameters passed this way are directly passed to s step .fit() method. If you check PolynomialFeatures documentation, degree argument is used in construction of the PolynomialFeatures object, not in its .fit() method.
If you want to try different hyperparameters for estimators/transformators within a pipeline, you could use GridSearchCV as shown here. Here's an example code from the link:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([
('select', SelectKBest()),
('model', calibrated_forest)])
param_grid = {
'select__k': [1, 2],
'model__base_estimator__max_depth': [2, 4, 6, 8]}
search = GridSearchCV(pipe, param_grid, cv=5).fit(X, y)

TypeError: __init__() got multiple values for argument 'n_splits'

I'm using SKLearn version (0.20.2) following by:
from sklearn.model_selection import StratifiedKFold
grid = GridSearchCV(
pipeline, # pipeline from above
params, # parameters to tune via cross validation
refit=True, # fit using all available data at the end, on the best found param combination
scoring='accuracy', # what score are we optimizing?
cv=StratifiedKFold(label_train, n_splits=5), # what type of cross validation to use
)
But i don't understand why i will get this error:
TypeError Traceback (most recent call last)
<ipython-input-26-03a56044cb82> in <module>()
10 refit=True, # fit using all available data at the end, on the best found param combination
11 scoring='accuracy', # what score are we optimizing?
---> 12 cv=StratifiedKFold(label_train, n_splits=5), # what type of cross validation to use
13 )
TypeError: __init__() got multiple values for argument 'n_splits'
Im already tried n_fold but come with the same error result. And also tired to update my scikit version and my conda. Any idea to fix this ? Thanks a lot!
StratifiedKFold takes exactly 3 arguments when initialized, none of which are the training data:
StratifiedKFold(n_splits=’warn’, shuffle=False, random_state=None)
So when you call StratifiedKFold(label_train, n_splits=5) it thinks you passed n_splits twice.
Instead, create the object, then use the methods as described in the example on the sklearn docs page for using the object to split your data:
get_n_splits([X, y, groups]) Returns the number of splitting
iterations in the cross-validator split(X, y[, groups]) Generate
indices to split data into training and test set.
StratifiedKFold takes three arguments but you are passing two arguments. See more in sklearn documentation
Create StratifiedKFold object and pass it to GridSearchCV as below.
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X_train, Y_train)
grid = GridSearchCV(
pipeline, # pipeline from above
params, # parameters to tune via cross validation
refit=True, # fit using all available data at the end, on the best found param combination
scoring='accuracy', # what score are we optimizing?
cv=skf, # what type of cross validation to use
)

Probabilistic SVM, regression

I've currently implemented a probabilistic (at least I think so) for binary classes. Now I want to extend this approach for regression, and I'm trying to use it for the Boston dataset. Unfortunately, it seems like my algorithm is stuck, the code I'm currently running is looking like this:
from sklearn import decomposition
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
inputs_train, inputs_test, targets_train, targets_test = train_test_split(X, y, test_size=0.33, random_state=42)
def plotting():
param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(svm.SVR(), cv = 5, param_grid= param_grid)
clf.fit(inputs_train, targets_train)
clf = SVR(C=clf.best_params_['C'], cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=5, gamma=clf.best_params_['gamma'],
kernel=clf.best_params_['kernel'],
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(inputs_train, targets_train)
a = clf.predict(inputs_test[0])
print(a)
plotting()
Can someone tell me, what is wrong in this approach, It's not the fact that I get some error message (I know, I've suppresed them above), but the code never stops running. Any suggestions is hugely appreciated.
There are several issues with your code.
To start with, what is taking forever is the first clf.fit (i.e. the grid search one), and that's why you didn't see any change when you set max_iter and tol in your second clf.fit.
Second, the clf=SVR() part will not work, because:
You have to import it, SVR is not recognizable
You have a bunch of illegal arguments in there (decision_function_shape, probability, random_state etc) - check the docs for the admissible SVR arguments.
Third, you don't need to explicitly fit again with the best parameters; you should simply ask for refit=True in your GridSearchCV definition and subsequently use clf.best_estimator_ for your predictions (EDIT after comment: simply clf.predict will also work).
So, moving the stuff outside of any function definition, here is a working version of your code:
from sklearn.svm import SVR
# other imports as-is
# data loading & splitting as-is
param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(SVR(degree=5, max_iter=10000), cv = 5, param_grid= param_grid, refit=True,)
clf.fit(inputs_train, targets_train)
a = clf.best_estimator_.predict(inputs_test[0])
# a = clf.predict(inputs_test[0]) will also work
print(a)
# [ 21.89849792]
Apart from degree, all the other admissible argument values you are are using are actually the respective default values, so the only arguments you really need in your SVR definition are degree and max_iter.
You'll get a couple of warnings (not errors), i.e. after fitting:
/databricks/python/lib/python3.5/site-packages/sklearn/svm/base.py:220:
ConvergenceWarning: Solver terminated early (max_iter=10000). Consider
pre-processing your data with StandardScaler or MinMaxScaler.
and after predicting:
/databricks/python/lib/python3.5/site-packages/sklearn/utils/validation.py:395:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample. DeprecationWarning)
which already contain some advice for what to do next...
Last but not least: a probabilistic classifier (i.e. one that produces probabilities instead of hard labels) is a valid thing, but a "probabilistic" regression model is not...
Tested with Python 3.5 and scikit-learn 0.18.1

Using GridSearchCV with AdaBoost and DecisionTreeClassifier

I am attempting to tune an AdaBoost Classifier ("ABT") using a DecisionTreeClassifier ("DTC") as the base_estimator. I would like to tune both ABT and DTC parameters simultaneously, but am not sure how to accomplish this - pipeline shouldn't work, as I am not "piping" the output of DTC to ABT. The idea would be to iterate hyper parameters for ABT and DTC in the GridSearchCV estimator.
How can I specify the tuning parameters correctly?
I tried the following, which generated an error below.
[IN]
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {dtc__criterion : ["gini", "entropy"],
dtc__splitter : ["best", "random"],
abc__n_estimators: [none, 1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
[OUT]
ValueError: Invalid parameter dtc for estimator AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(class_weight='auto', criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
random_state=11, splitter='best'),
learning_rate=1.0, n_estimators=50, random_state=11)
There are several things wrong in the code you posted:
The keys of the param_grid dictionary need to be strings. You should be getting a NameError.
The key "abc__n_estimators" should just be "n_estimators": you are probably mixing this with the pipeline syntax. Here nothing tells Python that the string "abc" represents your AdaBoostClassifier.
None (and not none) is not a valid value for n_estimators. The default value (probably what you meant) is 50.
Here's the code with these fixes.
To set the parameters of your Tree estimator you can use the "__" syntax that allows accessing nested parameters.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.grid_search import GridSearchCV
param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"n_estimators": [1, 2]
}
DTC = DecisionTreeClassifier(random_state = 11, max_features = "auto", class_weight = "auto",max_depth = None)
ABC = AdaBoostClassifier(base_estimator = DTC)
# run grid search
grid_search_ABC = GridSearchCV(ABC, param_grid=param_grid, scoring = 'roc_auc')
Also, 1 or 2 estimators does not really make sense for AdaBoost. But I'm guessing this is not the actual code you're running.
Hope this helps.
Trying to provide a shorter (and hopefully generic) answer.
If you want to grid search within a BaseEstimator for the AdaBoostClassifier e.g. varying the max_depth or min_sample_leaf of a DecisionTreeClassifier estimator, then you have to use a special syntax in the parameter grid.
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
parameters = {'base_estimator__max_depth':[i for i in range(2,11,2)],
'base_estimator__min_samples_leaf':[5,10],
'n_estimators':[10,50,250,1000],
'learning_rate':[0.01,0.1]}
clf = GridSearchCV(abc, parameters,verbose=3,scoring='f1',n_jobs=-1)
clf.fit(X_train,y_train)
So, note the 'base_estimator__max_depth' and 'base_estimator__min_samples_leaf' keys in the parameters dictionary. That's the way to access the hyperparameters of a BaseEstimator for an ensemble algorithm like AdaBoostClassifier when you are doing a grid search. Note the __ double underscore notation in particular. Other two keys in the parameters are the regular AdaBoostClassifier parameters.

Categories

Resources