scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline? - python

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:
class LabelOneHotEncoder():
def __init__(self):
self.ohe = OneHotEncoder()
self.le = LabelEncoder()
def fit_transform(self, x):
features = self.le.fit_transform( x)
return self.ohe.fit_transform( features.reshape(-1,1))
def transform( self, x):
return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
def inverse_tranform( self, x):
return self.le.inverse_transform( self.ohe.inverse_tranform( x))
def inverse_labels( self, x):
return self.le.inverse_transform( x)
I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:
LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])
I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.

It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.
class ModifiedLabelEncoder(LabelEncoder):
def fit_transform(self, y, *args, **kwargs):
return super().fit_transform(y).reshape(-1, 1)
def transform(self, y, *args, **kwargs):
return super().transform(y).reshape(-1, 1)
Then using the pipeline should work.
pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39

From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.

I have used a customized class to wrap my label encoder function and it returns the whole updated dataset.
class CustomLabelEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X ,y=None):
le=LabelEncoder()
for i in X[cat_cols]:
X[i]=le.fit_transform(X[i])
return X
cat_cols=['Family','Education','Securities Account','CDAccount','Online','CreditCard']
le_ct=make_column_transformer((CustomLabelEncode(),cat_cols),remainder='passthrough')
pd.DataFrame(ct3.fit_transform(X)) #This will show you your changes
Final_pipeline=make_pipeline(le_ct)
[I have implemented it you can see my github link]
[1]: https://github.com/Ayushmina-20/sklearn_pipeline

It is not for the asked question but for applying only LabelEncoder to all columns you can use the below format
df_non_numeric =df.select_dtypes(['object'])
non_numeric_cols = df_non_numeric.columns.values
from sklearn.preprocessing import LabelEncoder
for col in non_numeric_cols:
df[col] = LabelEncoder().fit_transform(df[col].values)
df.head()

Related

Pipeline GridSearchCV, corresponding parameters in different steps

I am trying to do some hyper-parameter tuning in my pipeline and have the following setup:
model = KerasClassifier(build_fn = create_model, epochs = 5)
pipeline = Pipeline(steps =[('Tokenizepadder', TokenizePadding()),
('NN', model)] )
Where I have a variable 'maxlen' in both the Tokenizepadder and my Neural Network (for the Neural Network it is called max_length, I was afraid naming them the same would cause errors later in the code). When I try to perform a grid search, I am struggling to have these values correspond. If I perform grid search for these values seperately, they won't mach and there will be a problem with the input data not matching the neural network.
In short I would like to do something like:
pipeline = Pipeline(steps =[('Tokenizepadder', TokenizePadding()),
('NN', KerasClassifier(build_fn = create_model, epochs = 5, max_length = pipeline.get_params()['Tokenizepadder__maxlen']))] )
So that when I am performing a grid search for the parameter 'Tokenizepadder__maxlen', it will change the value 'NN__max_length' to the same value.

May be you can change your classifier and tokenizer, to pass around max_len parameter. Then, only grid search with tokenizer max_len parameter.
Not the cleanest way, but might do.
from sklearn.base import BaseEstimator, TransformerMixin, EstimatorMixin
class TokeinizePadding(BaseEstimator, TransformerMixin):
def __init__(self, max_len, ...):
self.max_len = max_len
...
def fit(self, X, y=None):
...
return self
def transform(self, X, y=None):
data = ... # your stuff
return {"array": data, "max_len": self.max_len}
class KerasClassifier(...):
...
def fit(data, y):
self.max_len = data["max_len"]
self.build_model()
X = data["array"]
... # your stuff

Cannot perform reduce with flexible type in sklearn pipeline

I'm trying to implement a sklearn pipeline, my code is as follows. It's tips dataset: https://www.kaggle.com/jsphyg/tipping I'm trying to labelencode binary features, one hot encode day column and scale the total column. Below you can find one of my classes (the other two have almost the same structure so I won't post them, I get same error as I get with this one)
class onehotencode(BaseEstimator, TransformerMixin):
def __init__(self, column=None):
for column in cols_to_encode:
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X):
encoder = OneHotEncoder()
return encoder.fit_transform(X[self.column])
class labelencode(BaseEstimator, TransformerMixin):
def __init__(self, column = None):
for column in cols_to_encode_label:
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X):
encoder = LabelEncoder()
return encoder.fit_transform(X[self.column])
pipeline = Pipeline([('ohe',onehotencode()),
('le',labelencode()),
('scaler',scaler())])
df_transformed = pipeline.fit_transform(df)
When I try to fit to pipeline I get the following error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried to change the transform as follows:
def transform(self, X):
encoder = LabelEncoder()
return encoder.fit_transform(X[[self.column]])
When I do I get following error:
cannot perform reduce with flexible type
Can anyone help me? I really did search for above errors and couldn't fix it.
Thanks.

The first problem that I observed was the init method in class:
def __init__(self, column=None):
for column in cols_to_encode:
self.column = column
As I understand, you are trying to assign a list for columns to encode but the loop is not necessary there (for both encoders), you can simply assign the list as:
def __init__(self, columns):
self.columns = columns
For one hot encoder I think pd.get_dummies() is more elegant rather than OneHotEncoder, so the transform function will be:
def transform(self, X):
'''
1.Copy the information from original df,
2.Drop the old column from new df,
3.Return the new df with one hot encoded columns'''
new_df = X.copy(deep=True)
new_df.drop(self._cols_one_hot,axis=1,inplace=True)
return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
For label encoder part, it will not work for multiple columns because LabelEncoder does not support multiple columns encoding. So you have to visit and encode for each column.
def transform(self, X):
'''
1.Copy the information from original df,
2.Label encode the columns,
3.Drop the old columns,
4.Return the new df with label encoded columns'''
new_df = X.copy(deep=True)
label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
new_df.drop(self._cols_label_encode,axis=1,inplace=True)
return new_df.join(label_encoded_cols)
Final solution will be:
class onehotencode(BaseEstimator, TransformerMixin):
def __init__(self, cols_one_hot):
self._cols_one_hot = cols_one_hot
def fit(self, X, y=None):
return self
def transform(self, X):
'''
1.Copy the information from original df,
2.Drop the old column from new df,
3.Return the new df with one hot encoded columns
'''
new_df = X.copy(deep=True)
new_df.drop(self._cols_one_hot,axis=1,inplace=True)
return new_df.join(pd.get_dummies(X[self._cols_one_hot]))
class labelencode(BaseEstimator, TransformerMixin):
def __init__(self, cols_label_encode):
self._cols_label_encode = cols_label_encode
def fit(self, X, y=None):
return self
def transform(self, X):
'''
1.Copy the information from original df,
2.Label encode the columns,
3.Drop the old columns,
4.Return the new df with label encoded columns
'''
new_df = X.copy(deep=True)
label_encoded_cols = new_df[self._cols_label_encode].apply(LabelEncoder().fit_transform)
new_df.drop(self._cols_label_encode,axis=1,inplace=True)
return new_df.join(label_encoded_cols)
Then the pipeline will be called as:
pipeline = Pipeline([('ohe',onehotencode(cols_to_encode)),
('le',labelencode(cols_to_encode_label))])
df_transformed = pipeline.fit_transform(df)
df_transformed.head() will print:

Is there a diffrence of fit on GridSearchCV and Pipeline in sklearn?

Maybe this is just a bug or I am really stupid, I wrapped (or better said a colleague wrapped) a Keras model using some Keras transformations also wrapped so we can use the Keras model with the sklearn library.
Now when I use fit on the Pipeline it works fine. It runs and it returns a working model instance. However when I use a GridSearchCV for some reason it fails to do the transforms (or so it would seem) and it gives me the following error:
InvalidArgumentError (see above for traceback): indices[11,2] = 26048 is not in [0, 10001)
[[Node: embedding_4/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding_4/embeddings/read, embedding_4/Cast)]]
The code looks something like this:
vocab_size = 10001
class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def fit(self, X, y=None):
print('fitting the text')
print(self.document_count)
self.fit_on_texts(X)
return self
def transform(self, X, y=None):
print('transforming the text')
r = np.array(self.texts_to_sequences(X))
print(r)
print(self.document_count)
return r
class Padder(BaseEstimator, TransformerMixin):
def __init__(self, maxlen=500):
self.maxlen = maxlen
self.max_index = None
def fit(self, X, y=None):
#self.max_index = pad_sequences(X, maxlen=self.maxlen).max()
return self
def transform(self, X, y=None):
print('pad the text')
X = pad_sequences(X, maxlen=self.maxlen, padding='post')
#X[X > self.max_index] = 0
print(X)
return X
maxlen = 15
def makeLstmModel():
model = Sequential()
model.add(Embedding(10001, 100, input_length=15))
model.add(LSTM(35, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(16, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
lstmmodel = KerasClassifier(build_fn=makeLstmModel, epochs=5, batch_size=1000, verbose=42)
pipeline = [
('seq', TextsToSequences(num_words=vocab_size)),
('pad', Padder(maxlen)),
('clf', lstmmodel)
]
textClassifier = Pipeline(pipeline)
#Setup parameters
parameters = {} #Some params to use in gridsearch
skf = StratifiedKFold(n_splits=numberOfFolds, shuffle=True, random_state=1)
gscv = GridSearchCV(textClassifier, parameters, cv=skf, iid=False, n_jobs=1, verbose=50)
gscv.fit(x_train, y_train)
Now the above code fails with InvalidArgumentError, but when I run fit with the Pipeline it works:
Is there a difference between fit() in GridSearchCV and Pipeline? Am I really stupid or is this just a bug?
BTW, I am currently forced to use Sklearn 0.19.1.

After hours of thinking and debugging, I came to the following conclusion:
Pipeline.fit() is able to auto fill **kwargs arguments.
GridSearchCV.fit() is not able to auto fill **kwargs arguments.
I tested this on sklearn 0.19.1
My issue was that the bag of words created with Keras Tokenizer was created using the num_words parameter which limits the bag to a maximum number of words. My colleague did a bad job at this hence the number of words matches to the number of input dimensions in the LSTM model. Because the num_words were never set, the bag was always bigger than the input dimension.
The num_words were passed to the Tokenizer as **kwargs arguments.
class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
super().__init__(**kwargs)
For some reason GridSearchCV.fit() is not able to fill this automatically. The solution to this would be to use fixed arguments.
class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
def __init__(self, num_words=8000, **kwargs):
super().__init__(num_words, **kwargs)
After this change GridSearchCV.fit() works.

How to simplify my data preprocessing with scikit learn pipelines

I have 2 dfs. df1 are examples of cats and df2 are examples of dogs.
I have to do some preprocessing with these dfs that at the moment I'm doing by calling different functions. I would like to use scikit learn pipelines.
One of these functions is a special encoder function that will look at a column in the df and will return a special value. I rewrote that function in a class like I saw being used in scikit learn:
class Encoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.values = []
super().__init__()
def fit(self, X, y=None):
return self
def encode(self,row):
result = []
for base in row:
result.append(bases[base])
self.values.append(result)
def transform(self, X):
assert isinstance(X, pd.DataFrame)
X["seq_new"].apply(self.encode)
return self.values
so now I would have 2 lists as a result:
encode = Encoder()
X1 = encode.transform(df1)
X2 = encode.transform(df2)
next step would be:
features = np.concatenate((X1, X1), axis=0)
next step build the labels:
Y_dog = [[1]] * len(X1)
Y_cat = [[0]] * len(X2)
labels = np.concatenate((Y_dog, Y_cat), axis=0)
and some other manipulations and then I'll do a model_selection.train_test_split() to split the data into train and test.
How would I call all these functions in a scikit pipeline? The examples that I found start from where the train/test split has already been done.

The thing about an sklearn.pipeline.Pipeline is that every step needs to implement fit and transform. So, for instance, if you know for a fact that you will ALWAYS need to perform the concatenation step, and you really are dying to put it into a Pipeline (which I wouldn't, but that's just my humble opinion), you need to create a Concatenator class with the appropriate fit and transform methods.
Something like this:
class Encoder(object):
def fit(self, X, *args, **kwargs):
return self
def transform(self, X):
return X*2
class Concatenator(object):
def fit(self, X, *args, **kwargs):
return self
def transform(self, Xs):
return np.concatenate(Xs, axis=0)
class MultiEncoder(Encoder):
def transform(self, Xs):
return list(map(super().transform, Xs))
pipe = sklearn.pipeline.Pipeline((
("encoder", MultiEncoder()),
("concatenator", Concatenator())
))
pipe.fit_transform((
pd.DataFrame([[1,2],[3,4]]),
pd.DataFrame([[5,6],[7,8]])
))

sklearn Pipeline correct usage

I have a dataframe in python and it has datetime filed called 'datetime'. Using Pipeline and FeatureUnion i am trying to extract day,month,weekday and isBusinessday. In order to extract those features i have written custom code.
I am using the following code to extract day,month,weekday and isBusinessday
class itemselector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def transform(self, X):
return X[self.key]
def fit(self, X, y=None):
return self
f_df = Pipeline([
('union', FeatureUnion([
('date', Pipeline([
('sitem', itemselector('pickup_datetime')),
('sday', Extract_date()),
])),
('month', Pipeline([
('sitem', itemselector('pickup_datetime')),
('smonth', Extract_month()),
])),
])),
])
When i run this code i am getting list as a output. Say for example :
df = f_df.fit_transform(df_train[:5])
output :
[14 12 19 6 26 3 6 1 4 3] // it has both day and month. it is not expected output
But i was both day and month to be seperate features. How can i do that ? What went wrong in my code ? Can some one help me to find it ?
UPDATE
to summarise my problem, I am getting output shape (10,) but i want my output to be (5,2)
Updated 1 as per the request i have added necessary code
class Extract_date(BaseEstimator, TransformerMixin):
def fit(self, X):
print('one')
return self
def transform(self, X):
return X.apply(lambda y: y.day)
class Extract_month(BaseEstimator, TransformerMixin):
def fit(self, X, **atr):
print('two')
return self
def transform(self, X):
return X.apply(lambda y: y.month)

Ok, the Extract_month and Extract_date return a Series which is a 1-d vector, hence the FeatureUnion is not correctly stacking them. For FeatureUnion you need 2-d data with same number of rows from each internal transformer.
You can use reshape(-1,1) for this.
So change your methods like this:
class Extract_date(BaseEstimator, TransformerMixin):
...
...
def transform(self, X):
return X.apply(lambda y: y.day).values.reshape(-1,1)
class Extract_month(BaseEstimator, TransformerMixin):
...
...
def transform(self, X):
return X.apply(lambda y: y.month).values.reshape(-1,1)
Now the output should be correct. Feel free to ask if still any problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline? - python

From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.

Related

Pipeline GridSearchCV, corresponding parameters in different steps

Cannot perform reduce with flexible type in sklearn pipeline

Is there a diffrence of fit on GridSearchCV and Pipeline in sklearn?

How to simplify my data preprocessing with scikit learn pipelines

sklearn Pipeline correct usage

Categories

Resources