Question about best practices with custom scikit-learn estimator - python

I need to make a custom sklearn-compatible estimator that does some feature engineering such that feature1 is replaced by the difference feature1 - feature2. I further need this estimator to have an inverse_transform method that undoes the differencing. I have a minimal example of such an estimator here
class ColumnDifferencing(Estimator):
"""Calculate the difference between feature 1 and feature 2."""
def __init__(self, feature1: int = 0, feature2: int = 1):
self.feature1 = feature1
self.feature2 = feature2
def fit(self, data, y=None, **kwargs):
return self
def transform(self, data):
"""Replace self.feature1_ with self.feature1_ - self.feature2_"""
data[:, self.feature1] = data[:, self.feature1] - data[:, self.feature2]
self.feature2_values_ = data[:, self.feature2]
return data
def inverse_transform(self, data)
"""Replace self.feature1_ with self.feature1_ + self.feature2_values_"""
data[:, self.feature1] = data[:, self.feature1] + self.feature2_values_
return data
My concern here is that the inverse_transform method requires that the transform method was previously called, and that no rows have been dropped or re-ordered by any processing in between the call to transform and inverse_transform. Is this good/acceptable practice? Is there maybe a cleaner way that I am missing for implementing this type of estimator?

Related

Estimator base class wrapping model and preprocessing pipeline

I'm trying to work out how to easily wrap existing estimators along with a preprocessing pipeline and a target encoder, essentially generalizing the idea behind scikit's TransformedTargetRegressor. I have a possible solution but I'm wondering if I'm missing any repercussions of the design that are not immediately obvious. The basic idea is this:
class BaseModel:
Model = None
def __init__(self, feature_encoder=None, target_encoder=None, **params):
steps = [("features", feature_encoder)] if feature_encoder else []
steps.append(("model", self.Model(**params)))
self.pipe_ = Pipeline(steps=steps)
self.target_encoder_ = target_encoder
def get_params(self, deep=True):
"""Argument `deep` essentially differentiates the use of the resulting params dict.
- `deep=True` is used in GridSearch etc. to know which parameters can be set with `set_params`.
- `deep=False` is used by clone(), where the resulting keys must correspond to __init__ args.
"""
if deep:
return self.pipe_.get_params(deep=deep)
params = {
"feature_encoder": self.pipe_.named_steps["features"],
"target_encoder": self.target_encoder_,
}
params.update(self.pipe_.named_steps["model"].get_params())
return params
def set_params(self, **kwargs):
self.pipe_.set_params(**kwargs)
return self
def prepare_fit(self, X, y=None):
"""Encode target, determine model parameters dynamically depending on data etc."""
...
def fit(self, X, y=None):
y = self.prepare_fit(X, y)
self.pipe_.fit(X, y)
...
return self
def predict(self, X):
"""Predict and decode (inverse transform) target"""
yp = self.pipe_.predict(X)
...
And so wrapping CatBoost e.g. would simply be
class CatboostClassifier(BaseModel):
Model = catboost.CatBoostClassifier
The crucial part is getting get_params and set_params right such that the wrapped models play nice with scikit's grid search etc., even though the design doesn't follow the official guidelines of having model attributes match its __init__ args.
It seems to work, in the sense that getting and setting any of the model or preprocessing pipeline's parameters works, as does cloning:
model = CatboostClassifier(verbose=50, iterations=100, feature_encoder=..., target_encoder=...)
params = {
"model__iterations": 95,
"features__datetime__encode__components": ["year", "month"],
}
print("Params before:", {k: v for k, v in model.get_params().items() if k in params})
cloned = clone(clone(model).set_params(**params))
print("Params after:", {k: v for k, v in cloned.get_params().items() if k in params})
>> Params before: {'features__datetime__encode__components': ['year'], 'model__iterations': 100}
>> Params after: {'features__datetime__encode__components': ['year', 'month'], 'model__iterations': 95}
And GridSearchCV also seems to work:
params = {
"model__iterations": [100, 200],
"model__learning_rate": [0.2, 0.3],
}
search = GridSearchCV(model, param_grid=params, cv=5)
search.fit(X, y)
print(search.best_params_)
print(search.best_estimator_.get_params()["model__iterations"])
>> {'model__iterations': 100, 'model__learning_rate': 0.2}
>> 100
So... the question I have is if there is any other way this design may go wrong, e.g. any pending deprecations or planned changes that may break this use of get_params and set_params, e.g.? From what I can see, the only trick necessary to have an arbitrary constructor like this (breaking the official developer guidelines), is that get_params(deep=True) should return all settable parameters, while get_params(deep=False) is used by base.clone() only and needs to return any and all parameters necessary to call the constructor and make a (correct) copy.
I know this is a rather long "question", but I'd be grateful to know about any caveats I should be aware of regarding the proposed pattern.

Cannot perform reduce with flexible type in sklearn pipeline

I'm trying to implement a sklearn pipeline, my code is as follows. It's tips dataset: https://www.kaggle.com/jsphyg/tipping I'm trying to labelencode binary features, one hot encode day column and scale the total column. Below you can find one of my classes (the other two have almost the same structure so I won't post them, I get same error as I get with this one)
class onehotencode(BaseEstimator, TransformerMixin):
def __init__(self, column=None):
for column in cols_to_encode:
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X):
encoder = OneHotEncoder()
return encoder.fit_transform(X[self.column])
class labelencode(BaseEstimator, TransformerMixin):
def __init__(self, column = None):
for column in cols_to_encode_label:
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X):
encoder = LabelEncoder()
return encoder.fit_transform(X[self.column])
pipeline = Pipeline([('ohe',onehotencode()),
('le',labelencode()),
('scaler',scaler())])
df_transformed = pipeline.fit_transform(df)
When I try to fit to pipeline I get the following error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried to change the transform as follows:
def transform(self, X):
encoder = LabelEncoder()
return encoder.fit_transform(X[[self.column]])
When I do I get following error:
cannot perform reduce with flexible type
Can anyone help me? I really did search for above errors and couldn't fix it.
Thanks.
The first problem that I observed was the init method in class:
def __init__(self, column=None):
for column in cols_to_encode:
self.column = column
As I understand, you are trying to assign a list for columns to encode but the loop is not necessary there (for both encoders), you can simply assign the list as:
def __init__(self, columns):
self.columns = columns
For one hot encoder I think pd.get_dummies() is more elegant rather than OneHotEncoder, so the transform function will be:
def transform(self, X):
'''
1.Copy the information from original df,
2.Drop the old column from new df,
3.Return the new df with one hot encoded columns'''
new_df = X.copy(deep=True)
new_df.drop(self._cols_one_hot,axis=1,inplace=True)
return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
For label encoder part, it will not work for multiple columns because LabelEncoder does not support multiple columns encoding. So you have to visit and encode for each column.
def transform(self, X):
'''
1.Copy the information from original df,
2.Label encode the columns,
3.Drop the old columns,
4.Return the new df with label encoded columns'''
new_df = X.copy(deep=True)
label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
new_df.drop(self._cols_label_encode,axis=1,inplace=True)
return new_df.join(label_encoded_cols)
Final solution will be:
class onehotencode(BaseEstimator, TransformerMixin):
def __init__(self, cols_one_hot):
self._cols_one_hot = cols_one_hot
def fit(self, X, y=None):
return self
def transform(self, X):
'''
1.Copy the information from original df,
2.Drop the old column from new df,
3.Return the new df with one hot encoded columns
'''
new_df = X.copy(deep=True)
new_df.drop(self._cols_one_hot,axis=1,inplace=True)
return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
class labelencode(BaseEstimator, TransformerMixin):
def __init__(self, cols_label_encode):
self._cols_label_encode = cols_label_encode
def fit(self, X, y=None):
return self
def transform(self, X):
'''
1.Copy the information from original df,
2.Label encode the columns,
3.Drop the old columns,
4.Return the new df with label encoded columns
'''
new_df = X.copy(deep=True)
label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
new_df.drop(self._cols_label_encode,axis=1,inplace=True)
return new_df.join(label_encoded_cols)
Then the pipeline will be called as:
pipeline = Pipeline([('ohe',onehotencode(cols_to_encode)),
('le',labelencode(cols_to_encode_label))])
df_transformed = pipeline.fit_transform(df)
df_transformed.head() will print:

How to simplify my data preprocessing with scikit learn pipelines

I have 2 dfs. df1 are examples of cats and df2 are examples of dogs.
I have to do some preprocessing with these dfs that at the moment I'm doing by calling different functions. I would like to use scikit learn pipelines.
One of these functions is a special encoder function that will look at a column in the df and will return a special value. I rewrote that function in a class like I saw being used in scikit learn:
class Encoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.values = []
super().__init__()
def fit(self, X, y=None):
return self
def encode(self,row):
result = []
for base in row:
result.append(bases[base])
self.values.append(result)
def transform(self, X):
assert isinstance(X, pd.DataFrame)
X["seq_new"].apply(self.encode)
return self.values
so now I would have 2 lists as a result:
encode = Encoder()
X1 = encode.transform(df1)
X2 = encode.transform(df2)
next step would be:
features = np.concatenate((X1, X1), axis=0)
next step build the labels:
Y_dog = [[1]] * len(X1)
Y_cat = [[0]] * len(X2)
labels = np.concatenate((Y_dog, Y_cat), axis=0)
and some other manipulations and then I'll do a model_selection.train_test_split() to split the data into train and test.
How would I call all these functions in a scikit pipeline? The examples that I found start from where the train/test split has already been done.
The thing about an sklearn.pipeline.Pipeline is that every step needs to implement fit and transform. So, for instance, if you know for a fact that you will ALWAYS need to perform the concatenation step, and you really are dying to put it into a Pipeline (which I wouldn't, but that's just my humble opinion), you need to create a Concatenator class with the appropriate fit and transform methods.
Something like this:
class Encoder(object):
def fit(self, X, *args, **kwargs):
return self
def transform(self, X):
return X*2
class Concatenator(object):
def fit(self, X, *args, **kwargs):
return self
def transform(self, Xs):
return np.concatenate(Xs, axis=0)
class MultiEncoder(Encoder):
def transform(self, Xs):
return list(map(super().transform, Xs))
pipe = sklearn.pipeline.Pipeline((
("encoder", MultiEncoder()),
("concatenator", Concatenator())
))
pipe.fit_transform((
pd.DataFrame([[1,2],[3,4]]),
pd.DataFrame([[5,6],[7,8]])
))

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:
class LabelOneHotEncoder():
def __init__(self):
self.ohe = OneHotEncoder()
self.le = LabelEncoder()
def fit_transform(self, x):
features = self.le.fit_transform( x)
return self.ohe.fit_transform( features.reshape(-1,1))
def transform( self, x):
return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
def inverse_tranform( self, x):
return self.le.inverse_transform( self.ohe.inverse_tranform( x))
def inverse_labels( self, x):
return self.le.inverse_transform( x)
I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:
LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])
I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.
It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.
class ModifiedLabelEncoder(LabelEncoder):
def fit_transform(self, y, *args, **kwargs):
return super().fit_transform(y).reshape(-1, 1)
def transform(self, y, *args, **kwargs):
return super().transform(y).reshape(-1, 1)
Then using the pipeline should work.
pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39
From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.
I have used a customized class to wrap my label encoder function and it returns the whole updated dataset.
class CustomLabelEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X ,y=None):
le=LabelEncoder()
for i in X[cat_cols]:
X[i]=le.fit_transform(X[i])
return X
cat_cols=['Family','Education','Securities Account','CDAccount','Online','CreditCard']
le_ct=make_column_transformer((CustomLabelEncode(),cat_cols),remainder='passthrough')
pd.DataFrame(ct3.fit_transform(X)) #This will show you your changes
Final_pipeline=make_pipeline(le_ct)
[I have implemented it you can see my github link]
[1]: https://github.com/Ayushmina-20/sklearn_pipeline
It is not for the asked question but for applying only LabelEncoder to all columns you can use the below format
df_non_numeric =df.select_dtypes(['object'])
non_numeric_cols = df_non_numeric.columns.values
from sklearn.preprocessing import LabelEncoder
for col in non_numeric_cols:
df[col] = LabelEncoder().fit_transform(df[col].values)
df.head()

Predict with sklearn-KNN using median (instead of mean)

Sklearn-KNN allows one to set weights (e.g., uniform, distance) when calculating the mean x nearest neighbours.
Instead of predicting with the mean, is it possible to predict with the median (perhaps with a user-defined function)?
There is no built-in parameter to adjust the weighting to use the median rather than the mean (you can see in the source that the mean is hard-coded). But because scikit-learn estimators are just Python classes, you can subclass KNeighborsRegressor and override the predict method to do whatever you want.
Here's a quick example, where I've copied and pasted the original predict() method and modified the relevant piece:
from sklearn.neighbors.regression import KNeighborsRegressor, check_array, _get_weights
class MedianKNNRegressor(KNeighborsRegressor):
def predict(self, X):
X = check_array(X, accept_sparse='csr')
neigh_dist, neigh_ind = self.kneighbors(X)
weights = _get_weights(neigh_dist, self.weights)
_y = self._y
if _y.ndim == 1:
_y = _y.reshape((-1, 1))
######## Begin modification
if weights is None:
y_pred = np.median(_y[neigh_ind], axis=1)
else:
# y_pred = weighted_median(_y[neigh_ind], weights, axis=1)
raise NotImplementedError("weighted median")
######### End modification
if self._y.ndim == 1:
y_pred = y_pred.ravel()
return y_pred
X = np.random.rand(100, 1)
y = 20 * X.ravel() + np.random.rand(100)
clf = MedianKNNRegressor().fit(X, y)
print(clf.predict(X[:5]))
# [ 2.38172861 13.3871126 9.6737255 2.77561858 17.07392584]
I've left out the weighted version, because I don't know of a simple way to compute a weighted median with numpy/scipy, but it would be straightforward to add in once that function is available.

Categories

Resources