I'm trying to do RFECV on the transformed data using SciKit.
For that, I create a pipeline and pass the pipeline to the RFECV. It works fine unless I have ColumnTransformer as a pipeline step. It gives me the following error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have checked the answer for this Question, but I'm not sure if they are applicable here. The code is as follows:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
class CustomPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
X = pd.DataFrame({
'col1': [i for i in range(100)] ,
'col2': [i*2 for i in range(100)],
})
y = pd.DataFrame({'out': [i*3 for i in range(100)]})
ct = ColumnTransformer([("norm", Normalizer(norm='l1'), ['col1'])])
pipe = CustomPipeline([
('col_transform', ct),
('lr', LinearRegression())
])
rfecv = RFECV(
estimator=pipe,
step=1,
cv=3,
)
#pipe.fit(X,y) # pipe can fit, no problems
rfecv.fit(X,y)
Obviously, I can do this transformation step outside the pipeline and then use the transformed X, but I was wondering if there is any workaround.
I'd also like to raise this as an RFECV's design issue (it converts X to numpy array first thing, while other approaches with built-in cross-validation e.g. GridSearchCV do not do that)
Related
I used sklearn.model_selection.cross_validate for cross-validation of an sklearn.pipeline.Pipeline, which works great.
Now I am interested in the coefficients of a feature selection step in the pipeline. The selector used is SelectFromModel(LinearSVC(penalty="l1", dual=False)).
By setting return_estimator=True the cross-validation method should return the estimators fitted on each split. This works well for the classifier:
>>> pipeline[-1].coef_
[ 0.20973553 0.48124347 -0.27811877 ... ]
However, when I inspect the feature selection step, an attribute error is raised, as the object is not yet fitted:
>>> output = cross_validate(pipeline, X, y, cv=skf.split(X, data.cohort_idx), return_estimator=True)
>>> output['estimator'][1][-2].estimator.coef_
AttributeError: 'LinearSVC' object has no attribute 'coef_'
Fitting this step afterwards solves issue, but would be cumbersome and error prone in the cross_validation process:
>>> pipeline.fit(X, y)
>>> pipeline[3].estimator.coef_
[-0.27501591 0.14398988 0.83767175 ... ]
How do I get the cross_validate to return a fitted feature selector?
You can replicate this example using:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
# Make dummy data
X = np.random.rand(50, 4)
y = np.random.choice([True, False], 50)
# Make pipeline
selector = SelectFromModel(LinearSVC(penalty="l1", dual=False))
classifier = LogisticRegression()
pipeline = make_pipeline(selector, classifier)
# Cross validate
output = cross_validate(pipeline, X, y, return_estimator=True)
# Print coefficients
print('Classifier coef_:', output['estimator'][0][1].coef_)
print('Selector coef_: ', output['estimator'][0][0].estimator.coef_)
After finalizing this question, and just before submitting it, I realized I should access the member estimator_ and not estimator.
I hope this helps someone one day.
I wanted to test different scaler in my pipeline, so I created a class that can take as parameter the scaler I want (Standard, MinMax etc..)
It works fine. But i want to specifiy in my param_grid to test without scaler. And I added 'passthrough' in my pipeline estimators, but it doesn't work and i get the following error :
AttributeError: 'str' object has no attribute 'set_params'
I know it might have something to do with the construction of my class but i can't find how to fix it.
Here is the code you can run to test.
# import dependencies
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.datasets import load_breast_cancer
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class ScalerSelector(BaseEstimator, TransformerMixin):
def __init__(self, scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self, X, y=None):
return self.scaler.fit(X)
def transform(self, X, y=None):
return self.scaler.transform(X)
data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'], columns=data['feature_names'])
col_names = data.columns.tolist()
# scaler and encoder options
my_scaler = ScalerSelector()
preprocessor = ColumnTransformer(transformers = [('numerical', my_scaler, col_names)
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('log_reg', LogisticRegression())
])
# set params combination I want to try
scaler_options = {'preprocessor': ['passthrough'],
'preprocessor__numerical__scaler':[StandardScaler(), RobustScaler(), MinMaxScaler()]}
# : ['passthrough', ScalerSelector()]
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline, param_grid = scaler_options)
# fit the data
grid_cv.fit(data, target)
# best params :
grid_cv.best_params_
i have a problem where i am trying to apply transformations to my catgeorical feature 'country' and the rest of my numerical columns. how can i do this as i am trying below:
preprocess = make_column_transformer(
(numeric_cols, make_pipeline(MinMaxScaler())),
(categorical_cols, OneHotEncoder()))
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X_train, y_train)
note that numeric_cols is passed as a list and so is categorical_cols.
however i get this error: TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. along with a list of all my numerical columns (type <class 'list'>) doesn't.
what am i doing wrong, also how can i deal with unseen categories in column country?
You need to put the transform function first, then the columns as subsequent arguments, if you check out the help page, it writes:
sklearn.compose.make_column_transformer(*transformers, **kwargs)
Some like below will work:
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
X = pd.DataFrame({'x1':np.random.uniform(0,1,5),
'x2':np.random.choice(['A','B'],5)})
y = pd.Series(np.random.choice(['0','1'],5))
numeric_cols = X.select_dtypes('number').columns.to_list()
categorical_cols = X.select_dtypes('object').columns.to_list()
preprocess = make_column_transformer(
(MinMaxScaler(),numeric_cols),
(OneHotEncoder(),categorical_cols)
)
model = make_pipeline(preprocess,XGBClassifier())
model.fit(X,y)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('minmaxscaler',
MinMaxScaler(), ['x1']),
('onehotencoder',
OneHotEncoder(), ['x2'])])),
('xgbclassifier', XGBClassifier())])
I want to use sklearn.preprocessing.StandardScaler on a subset of pandas dataframe columns. Outside a pipeline this is trivial:
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])
But now assume I have column 'C' in df of type string and the following pipeline definition
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('standard', StandardScaler())
])
df_scaled = pipeline.fit_transform(df)
How can I tell StandardScaler to only scale columns A and B?
I'm used to SparkML pipelines where the features to be scaled can be passed to the constructor of the scaler component:
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
Note: The feature column is containing a sparse vector with all the numerical feature columns created by Spark's VectorAssembler
You could check out sklearn-pandas which offers an integration of Pandas DataFrame and sklearn, e.g. with the DataFrameMapper:
mapper = DataFrameMapper([
... (list_of_columnnames, StandardScaler())
... ])
I if you don't need external dependencies, you could use a simple own transformer, as I answered here:
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
pipe = make_pipeline(Columns(names=list_of_columnnames),StandardScaler())
In direct sklearn, you'll need to use FunctionTransformer together with FeatureUnion. That is, your pipeline will look like:
pipeline = Pipeline([
('scale_sum', feature_union(...))
])
where within the feature union, one function will apply the standard scaler to some of the columns, and the other will pass the other columns untouched.
Using Ibex (which I co-wrote exactly to make sklearn and pandas work better), you could write it as follows:
from ibex.sklearn.preprocessing import StandardScaler
from ibex import trans
pipeline = (trans(StandardScaler(), in_cols=['A', 'B']) + trans(None, ['c', 'd'])) | <other pipeline steps>
I am trying to add a calibration step in a sklearn pipeline to obtain a calibrated classifier and thus have more trustworthy probabilities in output.
So far I clumsily tried to insert a 'calibration' step using CalibratedClassifierCV along the lines of (silly example for reproducibility):
import sklearn.datasets
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
data = sklearn.datasets.fetch_20newsgroups(categories=['alt.atheism', 'sci.space'])
df = pd.DataFrame(data = np.c_[data['data'], data['target']])\
.rename({0:'text', 1:'class'}, axis = 'columns')
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', SGDClassifier(loss='modified_huber')),
('calibrator', CalibratedClassifierCV(cv=5, method='isotonic'))
])
my_pipeline.fit(df['text'].values, df['class'].values)
but that doesn't work (at least not in this way). Does anyone have tips about how to properly do this?
The SGDClassifier object should go into the CalibratedClassifierCV's base_estimator argument.
Your code should probably look something like this:
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', CalibratedClassifierCV(base_estimator=SGDClassifier(loss='modified_huber'), cv=5, method='isotonic'))
])
CalibratedClassifierCV is a meta-estimator.