Estimator base class wrapping model and preprocessing pipeline - python

I'm trying to work out how to easily wrap existing estimators along with a preprocessing pipeline and a target encoder, essentially generalizing the idea behind scikit's TransformedTargetRegressor. I have a possible solution but I'm wondering if I'm missing any repercussions of the design that are not immediately obvious. The basic idea is this:
class BaseModel:
Model = None
def __init__(self, feature_encoder=None, target_encoder=None, **params):
steps = [("features", feature_encoder)] if feature_encoder else []
steps.append(("model", self.Model(**params)))
self.pipe_ = Pipeline(steps=steps)
self.target_encoder_ = target_encoder
def get_params(self, deep=True):
"""Argument `deep` essentially differentiates the use of the resulting params dict.
- `deep=True` is used in GridSearch etc. to know which parameters can be set with `set_params`.
- `deep=False` is used by clone(), where the resulting keys must correspond to __init__ args.
"""
if deep:
return self.pipe_.get_params(deep=deep)
params = {
"feature_encoder": self.pipe_.named_steps["features"],
"target_encoder": self.target_encoder_,
}
params.update(self.pipe_.named_steps["model"].get_params())
return params
def set_params(self, **kwargs):
self.pipe_.set_params(**kwargs)
return self
def prepare_fit(self, X, y=None):
"""Encode target, determine model parameters dynamically depending on data etc."""
...
def fit(self, X, y=None):
y = self.prepare_fit(X, y)
self.pipe_.fit(X, y)
...
return self
def predict(self, X):
"""Predict and decode (inverse transform) target"""
yp = self.pipe_.predict(X)
...
And so wrapping CatBoost e.g. would simply be
class CatboostClassifier(BaseModel):
Model = catboost.CatBoostClassifier
The crucial part is getting get_params and set_params right such that the wrapped models play nice with scikit's grid search etc., even though the design doesn't follow the official guidelines of having model attributes match its __init__ args.
It seems to work, in the sense that getting and setting any of the model or preprocessing pipeline's parameters works, as does cloning:
model = CatboostClassifier(verbose=50, iterations=100, feature_encoder=..., target_encoder=...)
params = {
"model__iterations": 95,
"features__datetime__encode__components": ["year", "month"],
}
print("Params before:", {k: v for k, v in model.get_params().items() if k in params})
cloned = clone(clone(model).set_params(**params))
print("Params after:", {k: v for k, v in cloned.get_params().items() if k in params})
>> Params before: {'features__datetime__encode__components': ['year'], 'model__iterations': 100}
>> Params after: {'features__datetime__encode__components': ['year', 'month'], 'model__iterations': 95}
And GridSearchCV also seems to work:
params = {
"model__iterations": [100, 200],
"model__learning_rate": [0.2, 0.3],
}
search = GridSearchCV(model, param_grid=params, cv=5)
search.fit(X, y)
print(search.best_params_)
print(search.best_estimator_.get_params()["model__iterations"])
>> {'model__iterations': 100, 'model__learning_rate': 0.2}
>> 100
So... the question I have is if there is any other way this design may go wrong, e.g. any pending deprecations or planned changes that may break this use of get_params and set_params, e.g.? From what I can see, the only trick necessary to have an arbitrary constructor like this (breaking the official developer guidelines), is that get_params(deep=True) should return all settable parameters, while get_params(deep=False) is used by base.clone() only and needs to return any and all parameters necessary to call the constructor and make a (correct) copy.
I know this is a rather long "question", but I'd be grateful to know about any caveats I should be aware of regarding the proposed pattern.

Related

Pipeline GridSearchCV, corresponding parameters in different steps

I am trying to do some hyper-parameter tuning in my pipeline and have the following setup:
model = KerasClassifier(build_fn = create_model, epochs = 5)
pipeline = Pipeline(steps =[('Tokenizepadder', TokenizePadding()),
('NN', model)] )
Where I have a variable 'maxlen' in both the Tokenizepadder and my Neural Network (for the Neural Network it is called max_length, I was afraid naming them the same would cause errors later in the code). When I try to perform a grid search, I am struggling to have these values correspond. If I perform grid search for these values seperately, they won't mach and there will be a problem with the input data not matching the neural network.
In short I would like to do something like:
pipeline = Pipeline(steps =[('Tokenizepadder', TokenizePadding()),
('NN', KerasClassifier(build_fn = create_model, epochs = 5, max_length = pipeline.get_params()['Tokenizepadder__maxlen']))] )
So that when I am performing a grid search for the parameter 'Tokenizepadder__maxlen', it will change the value 'NN__max_length' to the same value.
May be you can change your classifier and tokenizer, to pass around max_len parameter. Then, only grid search with tokenizer max_len parameter.
Not the cleanest way, but might do.
from sklearn.base import BaseEstimator, TransformerMixin, EstimatorMixin
class TokeinizePadding(BaseEstimator, TransformerMixin):
def __init__(self, max_len, ...):
self.max_len = max_len
...
def fit(self, X, y=None):
...
return self
def transform(self, X, y=None):
data = ... # your stuff
return {"array": data, "max_len": self.max_len}
class KerasClassifier(...):
...
def fit(data, y):
self.max_len = data["max_len"]
self.build_model()
X = data["array"]
... # your stuff

How to simplify my data preprocessing with scikit learn pipelines

I have 2 dfs. df1 are examples of cats and df2 are examples of dogs.
I have to do some preprocessing with these dfs that at the moment I'm doing by calling different functions. I would like to use scikit learn pipelines.
One of these functions is a special encoder function that will look at a column in the df and will return a special value. I rewrote that function in a class like I saw being used in scikit learn:
class Encoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.values = []
super().__init__()
def fit(self, X, y=None):
return self
def encode(self,row):
result = []
for base in row:
result.append(bases[base])
self.values.append(result)
def transform(self, X):
assert isinstance(X, pd.DataFrame)
X["seq_new"].apply(self.encode)
return self.values
so now I would have 2 lists as a result:
encode = Encoder()
X1 = encode.transform(df1)
X2 = encode.transform(df2)
next step would be:
features = np.concatenate((X1, X1), axis=0)
next step build the labels:
Y_dog = [[1]] * len(X1)
Y_cat = [[0]] * len(X2)
labels = np.concatenate((Y_dog, Y_cat), axis=0)
and some other manipulations and then I'll do a model_selection.train_test_split() to split the data into train and test.
How would I call all these functions in a scikit pipeline? The examples that I found start from where the train/test split has already been done.
The thing about an sklearn.pipeline.Pipeline is that every step needs to implement fit and transform. So, for instance, if you know for a fact that you will ALWAYS need to perform the concatenation step, and you really are dying to put it into a Pipeline (which I wouldn't, but that's just my humble opinion), you need to create a Concatenator class with the appropriate fit and transform methods.
Something like this:
class Encoder(object):
def fit(self, X, *args, **kwargs):
return self
def transform(self, X):
return X*2
class Concatenator(object):
def fit(self, X, *args, **kwargs):
return self
def transform(self, Xs):
return np.concatenate(Xs, axis=0)
class MultiEncoder(Encoder):
def transform(self, Xs):
return list(map(super().transform, Xs))
pipe = sklearn.pipeline.Pipeline((
("encoder", MultiEncoder()),
("concatenator", Concatenator())
))
pipe.fit_transform((
pd.DataFrame([[1,2],[3,4]]),
pd.DataFrame([[5,6],[7,8]])
))

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:
class LabelOneHotEncoder():
def __init__(self):
self.ohe = OneHotEncoder()
self.le = LabelEncoder()
def fit_transform(self, x):
features = self.le.fit_transform( x)
return self.ohe.fit_transform( features.reshape(-1,1))
def transform( self, x):
return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
def inverse_tranform( self, x):
return self.le.inverse_transform( self.ohe.inverse_tranform( x))
def inverse_labels( self, x):
return self.le.inverse_transform( x)
I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:
LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])
I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.
It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.
class ModifiedLabelEncoder(LabelEncoder):
def fit_transform(self, y, *args, **kwargs):
return super().fit_transform(y).reshape(-1, 1)
def transform(self, y, *args, **kwargs):
return super().transform(y).reshape(-1, 1)
Then using the pipeline should work.
pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39
From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.
I have used a customized class to wrap my label encoder function and it returns the whole updated dataset.
class CustomLabelEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X ,y=None):
le=LabelEncoder()
for i in X[cat_cols]:
X[i]=le.fit_transform(X[i])
return X
cat_cols=['Family','Education','Securities Account','CDAccount','Online','CreditCard']
le_ct=make_column_transformer((CustomLabelEncode(),cat_cols),remainder='passthrough')
pd.DataFrame(ct3.fit_transform(X)) #This will show you your changes
Final_pipeline=make_pipeline(le_ct)
[I have implemented it you can see my github link]
[1]: https://github.com/Ayushmina-20/sklearn_pipeline
It is not for the asked question but for applying only LabelEncoder to all columns you can use the below format
df_non_numeric =df.select_dtypes(['object'])
non_numeric_cols = df_non_numeric.columns.values
from sklearn.preprocessing import LabelEncoder
for col in non_numeric_cols:
df[col] = LabelEncoder().fit_transform(df[col].values)
df.head()

Pass estimator to custom score function via sklearn.metrics.make_scorer

I'd like to make a custom scoring function involving classification probabilities as follows:
def custom_score(y_true, y_pred_proba):
error = ...
return error
my_scorer = make_scorer(custom_score, needs_proba=True)
gs = GridSearchCV(estimator=KNeighborsClassifier(),
param_grid=[{'n_neighbors': [6]}],
cv=5,
scoring=my_scorer)
Is there any way to pass the estimator, as fit by GridSearch with the given data and parameters, to my custom scoring function? Then I could interpret the probabilities using estimator.classes_
For example:
def custom_score(y_true, y_pred_proba, clf):
class_labels = clf.classes_
error = ...
return error
There is an alternative way to make a scorer mentioned in the documentation. Using this method I can do the following:
def my_scorer(clf, X, y_true):
class_labels = clf.classes_
y_pred_proba = clf.predict_proba(X)
error = ...
return error
gs = GridSearchCV(estimator=KNeighborsClassifier(),
param_grid=[{'n_neighbors': [6]}],
cv=5,
scoring=my_scorer)
This avoids the use of sklearn.metrics.make_scorer.
According to make_scorer docs, it receives **kwargs : additional arguments as additional parameters to be passed to score_func.
So you can just write your score function as:
def custom_score(y_true, y_pred_proba, clf):
class_labels = clf.classes_
error = ...
return error
Then use make_scorer as:
my_scorer = make_scorer(custom_score, needs_proba=True, clf=clf_you_want)
The benefit of this method is you can pass any other param to your score function easily.

How to get Tensorflow seq2seq embedding output

I am attempting to train a sequence to sequence model using tensorflow and have been looking at their example code.
I want to be able to access the vector embeddings created by the encoder as they seem to have some interesting properties.
However, it really isn't clear to me how this can be.
In the vector representations of words example they talk a lot about what these embeddings can be used for and then don't appear to provide a simple way of accessing them, unless I am mistaken.
Any help figuring out how to access them would be greatly appreciated.
As with all Tensorflow operations, most variables are dynamically created. There are different ways to access these variables ( and their values ). Here, the variable you are interested in is part of the set of trained variables. To access these, we can thus use the tf.trainable_variables() function:
for var in tf.trainable_variables():
print var.name
which will give us - for a GRU seq2seq model, the following list:
embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding:0
embedding_rnn_seq2seq/RNN/GRUCell/Gates/Linear/Matrix:0
embedding_rnn_seq2seq/RNN/GRUCell/Gates/Linear/Bias:0
embedding_rnn_seq2seq/RNN/GRUCell/Candidate/Linear/Matrix:0
embedding_rnn_seq2seq/RNN/GRUCell/Candidate/Linear/Bias:0
embedding_rnn_seq2seq/embedding_rnn_decoder/embedding:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Gates/Linear/Matrix:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Gates/Linear/Bias:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Candidate/Linear/Matrix:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/GRUCell/Candidate/Linear/Bias:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/OutputProjectionWrapper/Linear/Matrix:0
embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/OutputProjectionWrapper/Linear/Bias:0
This tells us that the embedding is called embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding:0, which we can then use to retrieve a pointer to that variable in our earlier iterator:
for var in tf.trainable_variables():
print var.name
if var.name == 'embedding_rnn_seq2seq/RNN/EmbeddingWrapper/embedding:0':
embedding_op = var
This we can then pass along with other ops to our session-run:
_, loss_t, summary, embedding = sess.run([train_op, loss, summary_op, embedding_op], feed_dict)
and we have ourselves the (batch-list of) embeddings ...
There is a related post, but it is based on tensorflow-0.6 which is quite out of date. So I update his answer in tensorflow-0.8 which is also similar to that in the newest version.
(*represent where to modify)
losses = []
outputs = []
*states = []
with ops.op_scope(all_inputs, name, "model_with_buckets"):
for j, bucket in enumerate(buckets):
with variable_scope.variable_scope(variable_scope.get_variable_scope(),
reuse=True if j > 0 else None):
*bucket_outputs, _ ,bucket_states= seq2seq(encoder_inputs[:bucket[0]],
decoder_inputs[:bucket[1]])
outputs.append(bucket_outputs)
if per_example_loss:
losses.append(sequence_loss_by_example(
outputs[-1], targets[:bucket[1]], weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
else:
losses.append(sequence_loss(
outputs[-1], targets[:bucket[1]], weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
return outputs, losses, *states
at python/ops/seq2seq, modify embedding_attention_seq2seq()
if isinstance(feed_previous, bool):
*outputs, states = embedding_attention_decoder(
decoder_inputs, encoder_state, attention_states, cell,
num_decoder_symbols, embedding_size, num_heads=num_heads,
output_size=output_size, output_projection=output_projection,
feed_previous=feed_previous,
initial_state_attention=initial_state_attention)
*return outputs, states, encoder_state
# If feed_previous is a Tensor, we construct 2 graphs and use cond.
def decoder(feed_previous_bool):
reuse = None if feed_previous_bool else True
with variable_scope.variable_scope(variable_scope.get_variable_scope(),reuse=reuse):
outputs, state = embedding_attention_decoder(
decoder_inputs, encoder_state, attention_states, cell,
num_decoder_symbols, embedding_size, num_heads=num_heads,
output_size=output_size, output_projection=output_projection,
feed_previous=feed_previous_bool,
update_embedding_for_previous=False,
initial_state_attention=initial_state_attention)
return outputs + [state]
outputs_and_state = control_flow_ops.cond(feed_previous, lambda: decoder(True), lambda: decoder(False))
*return outputs_and_state[:-1], outputs_and_state[-1], encoder_state
at model/rnn/translate/seq2seq_model.py modify init()
if forward_only:
*self.outputs, self.losses, self.states= tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs, targets,
self.target_weights, buckets, lambda x, y: seq2seq_f(x, y, True),
softmax_loss_function=softmax_loss_function)
# If we use output projection, we need to project outputs for decoding.
if output_projection is not None:
for b in xrange(len(buckets)):
self.outputs[b] = [
tf.matmul(output, output_projection[0]) + output_projection[1]
for output in self.outputs[b]
]
else:
*self.outputs, self.losses, _ = tf.nn.seq2seq.model_with_buckets(
self.encoder_inputs, self.decoder_inputs, targets,
self.target_weights, buckets,
lambda x, y: seq2seq_f(x, y, False),
softmax_loss_function=softmax_loss_function)
at model/rnn/translate/seq2seq_model.py modify step()
if not forward_only:
return outputs[1], outputs[2], None # Gradient norm, loss, no outputs.
else:
*return None, outputs[0], outputs[1:], outputs[-1] # No gradient norm, loss, outputs.
with all these done, we can get the encoded states by calling :
_, _, output_logits, states = model.step(sess, encoder_inputs, decoder_inputs,
target_weights, bucket_id, True)
print (states)
in the translate.py.

Categories

Resources