FeatureUnion : Sklearn FeatureUnion does not allows fit params - python

FeatureUnion is not able to fit. The last line of the following code fit() throws error as:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
pca = PCA(n_components=2)
selection = SelectKBest(k=1)
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
svm = SVC(kernel="linear")
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
pipeline.fit (X, y, features__univ_select__k=2)
Error Thrown:
TypeError: fit_transform() got an unexpected keyword argument 'univ_select__k'

The argument features__univ_select__k=2 is not used in fit; on the other hand, it is simply not necessary here.
If you are following the feature union example in the scikit-learn docs (as you seem to be), you should notice that there it is used as an argument in param_grid, and not in fit.
But here you don't perform any parameter grid search; and since you have already defined that
pca = PCA(n_components=2)
selection = SelectKBest(k=1)
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
there is not any other "choice" for the number of features to be used, hence you should not use any features__univ_select__k=2 argument. Simply giving
pipeline.fit (X, y)
does the job:
Pipeline(memory=None,
steps=[('features', FeatureUnion(n_jobs=None,
transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('univ_select', SelectKBest(k=1, score_func=<function f_classif at 0x000000000808AD08>))],
tran...r', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])

Related

sklearn.cross_validate returns one unfitted estimator

I used sklearn.model_selection.cross_validate for cross-validation of an sklearn.pipeline.Pipeline, which works great.
Now I am interested in the coefficients of a feature selection step in the pipeline. The selector used is SelectFromModel(LinearSVC(penalty="l1", dual=False)).
By setting return_estimator=True the cross-validation method should return the estimators fitted on each split. This works well for the classifier:
>>> pipeline[-1].coef_
[ 0.20973553 0.48124347 -0.27811877 ... ]
However, when I inspect the feature selection step, an attribute error is raised, as the object is not yet fitted:
>>> output = cross_validate(pipeline, X, y, cv=skf.split(X, data.cohort_idx), return_estimator=True)
>>> output['estimator'][1][-2].estimator.coef_
AttributeError: 'LinearSVC' object has no attribute 'coef_'
Fitting this step afterwards solves issue, but would be cumbersome and error prone in the cross_validation process:
>>> pipeline.fit(X, y)
>>> pipeline[3].estimator.coef_
[-0.27501591 0.14398988 0.83767175 ... ]
How do I get the cross_validate to return a fitted feature selector?
You can replicate this example using:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
# Make dummy data
X = np.random.rand(50, 4)
y = np.random.choice([True, False], 50)
# Make pipeline
selector = SelectFromModel(LinearSVC(penalty="l1", dual=False))
classifier = LogisticRegression()
pipeline = make_pipeline(selector, classifier)
# Cross validate
output = cross_validate(pipeline, X, y, return_estimator=True)
# Print coefficients
print('Classifier coef_:', output['estimator'][0][1].coef_)
print('Selector coef_: ', output['estimator'][0][0].estimator.coef_)
After finalizing this question, and just before submitting it, I realized I should access the member estimator_ and not estimator.
I hope this helps someone one day.

Problem passing Pipeline through nested RFECV and GridSearchCV

I'm trying to perform feature selection and grid search for the inner loop of a nested CV in sklearn. While I can pass a pipeline as the estimator to the RFECV, I then receive an error on fitting when I pass the RFECV as the estimator to GridSearchCV.
I have found that changing the name of the model in the pipeline to 'estimator' moves the error to the Pipeline with 'regression an invalid parameter' rather than in the RFECV where whatever I named the model was an invalid parameter.
I have verified using rfcv.get_params().keys() and pipeline.get_params().keys() that the parameters I'm calling do exist.
I do not receive this error if I name SGDRegressor() directly as ‘estimator’ and ignore the pipeline entirely, but this model requires feature scaling and log transforming of the Y variable.
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import SGDRegressor
import numpy as np
# random sample data
X = np.random.rand(100,2)
y = np.random.rand(100)
#passing coef amd importance through pipe and TransformedTargetRegressor
class MyPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
class MyTransformedTargetRegressor(TransformedTargetRegressor):
#property
def feature_importances_(self):
return self.regressor_.feature_importances_
#property
def coef_(self):
return self.regressor_.coef_
# build pipeline
pipeline = MyPipeline([ ('scaler', MinMaxScaler()),
('estimator', MyTransformedTargetRegressor(regressor=SGDRegressor(), func=np.log1p, inverse_func=np.expm1))])
# define tuning grid
parameters = {"estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
# instantiate inner cv
inner_kv = KFold(n_splits=5, shuffle=True, random_state=42)
rfcv = RFECV(estimator=pipeline, step=1, cv=inner_kv, scoring="neg_mean_squared_error")
cv = GridSearchCV(estimator=rfcv, param_grid=parameters, cv=inner_kv, iid=True,
scoring= "neg_mean_squared_error", n_jobs=-1, verbose=True)
cv.fit(X,y)
I receive the following error and can confirm that regressor is a parameter for estimator pipeline:
ValueError: Invalid parameter regressor for estimator MyPipeline(memory=None,
steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
('estimator',
MyTransformedTargetRegressor(check_inverse=True,
func=<ufunc 'log1p'>,
inverse_func=<ufunc 'expm1'>,
regressor=SGDRegressor(alpha=0.0001,
average=False,
early_stopping=False,
epsilon=0.1,
eta0=0.01,
fit_intercept=True,
l1_ratio=0.15,
learning_rate='invscaling',
loss='squared_loss',
max_iter=1000,
n_iter_no_change=5,
penalty='l2',
power_t=0.25,
random_state=None,
shuffle=True,
tol=0.001,
validation_fraction=0.1,
verbose=0,
warm_start=False),
transformer=None))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
Thanks
It has to be estimator__estimator__regressor because you have the pipeline inside the rfecv.
Try this!
parameters = {"estimator__estimator__regressor__alpha": [1e-5,1e-4,1e-3,1e-2,1e-1],
"estimator__estimator__regressor__l1_ratio": [0.001,0.25,0.5,0.75,0.999]}
Note: Having a nested CV is not a right approach. May be you can do the feature selection separately and then do the model training.

Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn.
GridSearchCV and RFE with "bare" classifier works fine:
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)
Putting classifier in a pipeline returns an error: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)
EDIT:
I have realised that I was not clear describing the problem. This is the clearer snippet:
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)
# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)
As you can see, the only difference is putting the estimator in a pipeline. Pipeline, however, hides "coef_" or "feature_importances_" attributes. The questions are:
Is there a nice way of dealing with this in scikit-learn?
If not, is this behaviour desired for any reason?
EDIT2:
Updated, working snippet based on the answer provided by #Chris
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class MyPipe(pipeline.Pipeline):
def fit(self, X, y=None, **fit_params):
"""Calls last elements .coef_ method.
Based on the sourcecode for decision_function(X).
Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
----------
"""
super(MyPipe, self).fit(X, y, **fit_params)
self.coef_ = self.steps[-1][-1].coef_
return self
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
You have an issue with your use of pipeline.
A pipeline works as below:
first object is applied to data when you call .fit(x,y) etc. If that method exposes a .transform() method, this is applied and this output is used as the input for the next stage.
A pipeline can have any valid model as a final object, but all previous ones MUST expose a .transform() method.
Just like a pipe - you feed in data and each object in the pipeline takes the previous output and does another transform on it.
As we can see,
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform
RFE exposes a transform method, and so should be included in the pipeline itself.
E.g.
some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]
Your attempt has a few issues.
Firstly, you are trying to scale a slice of your data. Imagine I had two partitions [1,1], [10,10]. If I normalize by the mean of the partition I lose the information that my second partition is significantly above the mean. You should scale at the start, not in the middle.
Secondly, SVR does not implement a transform method, you cannot incorporate it as a non final element in a pipeline.
RFE takes in a model which it fits to the data and then evaluates the weight of each feature.
EDIT:
You can include this behaviour if you wish, by wrapping the sklearn pipeline in your own class. What we want to do is when we fit the data, retrieve the last estimators .coef_ method and store that locally in our derived class under the correct name.
I suggest you look into the sourcecode on github as this is only a first start and more error checking etc would probably be required. Sklearn uses a function decorator called #if_delegate_has_method which would be a handy thing to add to ensure the method generalises. I have run this code to make sure it works runs, but nothing more.
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class myPipe(pipeline.Pipeline):
def fit(self, X,y):
"""Calls last elements .coef_ method.
Based on the sourcecode for decision_function(X).
Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
----------
"""
super(myPipe, self).fit(X,y)
self.coef_=self.steps[-1][-1].coef_
return
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]
pipe = myPipe(pipe_params)
selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)
print clf.best_params_
if anything is not clear, please ask.
I think you had a slightly different way of constructing the pipeline than what was listed in the pipeline documentation.
Are you looking for this?
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
selector = feature_selection.RFE(est)
pipe_params = [('feat_selection',selector),('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
param_grid = dict(clf__C=[0.1, 1, 10])
clf = GridSearchCV(pipe, param_grid=param_grid, cv=2)
clf.fit(X, y)
print clf.grid_scores_
Also see this useful example for combining things in a pipeline. For the RFE object, I just used the official documentation for constructing it with your SVR estimator - I then just put the RFE object into the pipeline in the same way as you had done with the scaler and estimator objects.

When do items in the Pipeline call fit_transform(), and when do they call transform()? (scikit-learn, Pipeline)

I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.

ValueError using recursive feature elimination for SVM with rbf kernel in scikit-learn

I'm trying to use the recursive feature elimination (RFE) function in scikit-learn but keep getting the error ValueError: coef_ is only available when using a linear kernel. I am trying to perform feature selection for a support vector classifier (SVC) using a rbf kernel. This example from the website executes fine:
print(__doc__)
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.metrics import zero_one_loss
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0)
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
scoring='accuracy')
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()
However, simply changing the kernel type from linear to rbf, as follows, produces the error:
print(__doc__)
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.metrics import zero_one_loss
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
n_redundant=2, n_repeated=0, n_classes=8,
n_clusters_per_class=1, random_state=0)
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="rbf")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
scoring='accuracy')
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()
This seems like it could be a bug, but if anyone could spot something I'm doing wrong that would be great. Also, I'm running python 2.7.6 with scikit-learn version 0.14.1.
Thanks for the help!
This seems to the expected outcome. RFECV requires the estimator to have an coef_ which signifies the feature importances:
estimator : object
A supervised learning estimator with a fit method that updates a coef_ attribute that holds the fitted parameters. Important features must correspond to high absolute values in the coef_ array.
By changing the kernel to RBF, the SVC is no longer linear and the coef_ attribute becomes unavailable, according to the documentation:
coef_
array, shape = [n_class-1, n_features]
Weights asigned to the features (coefficients in the primal problem). This is only available in the case of linear kernel.
The error is raised by SVC (source) when RFECV is trying to access coef_ when the kernel is not linear.

Categories

Resources