How to scale target values of a Keras autoencoder model using a sklearn pipeline?

How to scale target values of a Keras autoencoder model using a sklearn pipeline? - python

I'm using sklearn pipelines to build a Keras autoencoder model and use gridsearch to find the best hyperparameters. This works fine if I use a Multilayer Perceptron model for classification; however, in the autoencoder I need the output values to be the same as input. In other words, I am using a StandardScalar instance in the pipeline to scale the input values and therefore this leads to my question: how can I make the StandardScalar instance inside the pipeline to work on both the input data as well as target data, so that they end up to be the same?
I'm providing a code snippet as an example.
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop, Adam
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
X, y = make_classification (n_features = 50, n_redundant = 0, random_state = 0,
scale = 100, n_clusters_per_class = 1)
# Define wrapper
def create_model (learn_rate = 0.01, input_shape, metrics = ['mse']):
model = Sequential ()
model.add (Dense (units = 64, activation = 'relu',
input_shape = (input_shape, )))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (8, activation = 'relu'))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (input_shape, activation = None))
model.compile (loss = 'mean_squared_error',
optimizer = Adam (lr = learn_rate),
metrics = metrics)
return model
# Create scaler
my_scaler = StandardScaler ()
steps = list ()
steps.append (('scaler', my_scaler))
standard_scaler_transformer = Pipeline (steps)
# Create classifier
clf = KerasRegressor (build_fn = create_model, verbose = 2)
# Assemble pipeline
# How to scale input and output??
clf = Pipeline (steps = [('scaler', my_scaler),
('classifier', clf)],
verbose = True)
# Run grid search
param_grid = {'classifier__input_shape' : [X.shape [1]],
'classifier__batch_size' : [50],
'classifier__learn_rate' : [0.001],
'classifier__epochs' : [5, 10]}
cv = KFold (n_splits = 5, shuffle = False)
grid = GridSearchCV (estimator = clf, param_grid = param_grid,
scoring = 'neg_mean_squared_error', verbose = 1, cv = cv)
grid_result = grid.fit (X, X)
print ('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))

You can use TransformedTargetRegressor to apply arbitrary transformations on the target values (i.e. y) by providing either a function (i.e. using func argument) or a transformer (i.e. transformer argument).
In this case (i.e. fitting an auto-encoder model), since you want to apply the same StandardScalar instance on the target values as well, you can use transformer argument. And it could be done in one of the following ways:
You can use it as one of the pipeline steps, wrapping the regressor:
scaler = StandardScaler()
regressor = KerasRegressor(...)
pipe = Pipeline(steps=[
('scaler', scaler),
('ttregressor', TransformedTargetRegressor(regressor, transformer=scaler))
])
# Use `__regressor` to access the regressor hyperparameters
param_grid = {'ttregressor__regressor__hyperparam_name' : ...}
gridcv = GridSearchCV(estimator=pipe, param_grid=param_grid, ...)
gridcv.fit(X, X)
Alternatively, you can wrap it around the GridSearchCV like this:
ttgridcv = TransformedTargetRegressor(GridSearchCV(...), transformer=scalar)
ttgridcv.fit(X, X)
# Use `regressor_` attribute to access the fitted regressor (i.e. `GridSearchCV` instance)
print(ttgridcv.regressor_.best_score_, ttgridcv.regressor_.best_params_))

Related

how to predict multiple dependent columns from 1 independent column

is it possible to predict multiple dependent columns from independent columns?
Problem Statement: I have to predict 5 factors(cEXT, cNEU,cAGR, cCON, cOPN) on the basis of STATUS column, so input variable will be STATUS column only and target variables are (cEXT, cNEU,cAGR, cCON, cOPN).
here in the above data STATUS is an independent column and cEXT, cNEU,cAGR, cCON, cOPN are the dependent columns, how can I predict those?
# independent and dependent variable split
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
right now I am predicting only one column so repeating the same thing 5 times so I am creating 5 models for 5 target variables.
Code:
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
ct = ColumnTransformer([
('step1', TfidfVectorizer(), 'STATUS')
],remainder='drop')
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics
from sklearn.pipeline import Pipeline
# ##########
# RandomForest
# ##########
model = Pipeline([
('column_transformers', ct),
('model', RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto')),
])
# creating 5 models, can I create 1 model?
model_cEXT = model.fit(X_train, y_train['cEXT'])
model_cNEU = model.fit(X_train, y_train['cNEU'])
model_cAGR = model.fit(X_train, y_train['cAGR'])
model_cCON = model.fit(X_train, y_train['cCON'])
model_cOPN = model.fit(X_train, y_train['cOPN'])

You can use multioutput classifier from scikit-learn.
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
clf = MultiOutputClassifier(RandomForestClassifier()).fit(X_train, y_train)
clf.predict(X_test)
Reference:
Official document of MultiOutputClassifier

There is a library scikit-multilearn which is very good for these tasks. There are several ways to do multi-label classification such as PowerSet, ClassifierChain etc. These are very well covered in this library.
Below is a sample of how it will replace your current code.
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
# Rest of your code
==========================
# The new code
from skmultilearn.problem_transform import BinaryRelevance
from scipy.sparse import csr_matrix
classifier = BinaryRelevance(
classifier = RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto'),
require_dense = [False, True]
)
model = Pipeline([
('column_transformers', ct),
('classifier', classifier),
])
model.fit(X_train, y_train.values)
res = model.predict(X_test)
res = csr_matrix(res)
res.todense()
You can explore other methods here.
In TensorFlow you can do this using sigmoid activation and binaryCE loss on all the units. As below:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
tfidf_calculator = TextVectorization(
standardize = 'lower_and_strip_punctuation',
split = 'whitespace',
max_tokens = 100,
output_mode ='tf-idf',
pad_to_max_tokens=False)
tfidf_calculator.adapt(df['Status'].values)
tfids = tfidf_calculator(df['Status'])
X = tfids.numpy()
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(100,)),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(5, activation='sigmoid')
])
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())
model.fit(X_train, y_train, epochs=20, batch_size=32)
The thing to take note of in TensorFlow is that you need a dense matrix as input. There might be a way to use sparse but I didn't find any.

VotingClassifier with pipelines as estimators

I want to build an sklearn VotingClassifier ensemble out of multiple different models (Decision Tree, SVC, and a Keras Network). All of them need a different kind of data preprocessing, which is why I made a pipeline for each of them.
# Define pipelines
# DTC pipeline
featuriser = Featuriser()
dtc = DecisionTreeClassifier()
dtc_pipe = Pipeline([('featuriser',featuriser),('dtc',dtc)])
# SVC pipeline
scaler = TimeSeriesScalerMeanVariance(kind='constant')
flattener = Flattener()
svc = SVC(C = 100, gamma = 0.001, kernel='rbf')
svc_pipe = Pipeline([('scaler', scaler),('flattener', flattener), ('svc', svc)])
# Keras pipeline
cnn = KerasClassifier(build_fn=get_model())
cnn_pipe = Pipeline([('scaler',scaler),('cnn',cnn)])
# Make an ensemble
ensemble = VotingClassifier(estimators=[('dtc', dtc_pipe),
('svc', svc_pipe),
('cnn', cnn_pipe)],
voting='hard')
The Featuriser,TimeSeriesScalerMeanVariance and Flattener classes are some custom made transformers that all employ fit,transform and fit_transform methods.
When I try to ensemble.fit(X, y) fit the whole ensemble I get the error message:
ValueError: The estimator list should be a classifier.
Which I can understand, as the individual estimators are not specifically classifiers but pipelines. Is there a way to still make it work?

The problem is with the KerasClassifier. It does not provide the _estimator_type, which was checked in _validate_estimator.
It is not the problem of using pipeline. Pipeline provides this information as a property. See here.
Hence, the quick fix is setting _estimator_type='classifier'.
A reproducible example:
# Define pipelines
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.ensemble import VotingClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.datasets import make_classification
from keras.layers import Dense
from keras.models import Sequential
X, y = make_classification()
# DTC pipeline
featuriser = MinMaxScaler()
dtc = DecisionTreeClassifier()
dtc_pipe = Pipeline([('featuriser', featuriser), ('dtc', dtc)])
# SVC pipeline
scaler = Normalizer()
svc = SVC(C=100, gamma=0.001, kernel='rbf')
svc_pipe = Pipeline(
[('scaler', scaler), ('svc', svc)])
# Keras pipeline
def get_model():
# create model
model = Sequential()
model.add(Dense(10, input_dim=20, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
cnn = KerasClassifier(build_fn=get_model)
cnn._estimator_type = "classifier"
cnn_pipe = Pipeline([('scaler', scaler), ('cnn', cnn)])
# Make an ensemble
ensemble = VotingClassifier(estimators=[('dtc', dtc_pipe),
('svc', svc_pipe),
('cnn', cnn_pipe)],
voting='hard')
ensemble.fit(X, y)

kfold cross validation wont terminate, stuck at cross_val_score

I am trying to run kfold cross validation. but for some reason, it gets stuck here, it wont terminate from here accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
i cant understand whats the problem. and how do i fix it.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:,1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
import keras
from keras.models import Sequential #Required to initialize the ANN
from keras.layers import Dense #Build layers of ANN
from keras.layers import Dropout
# Evaluating the ANN
import keras
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential #Required to initialize the ANN
from keras.layers import Dense #Build layers of ANN
def build_classifier(): # Builds the architecture, or the classifier
classifier = Sequential()
classifier.add(Dense(activation = 'relu', input_dim = 11, units = 6, kernel_initializer = 'uniform'))# add layers
classifier.add(Dense(activation = 'relu', units = 6, kernel_initializer = 'uniform'))# add layers
classifier.add(Dense(activation = 'sigmoid', units = 1, kernel_initializer = 'uniform'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, nb_epoch = 100)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
mean = accuracies.mean()
variance = accuracies.std()
Edit
Im on windows 10 using Anaconda with python 3.6.
Dataset : Drive Link for dataset
It works perfectly when i set n_jobs = 1 but not when n_jobs = -1

Since you have set the n_jobs = -1, then all the CPUs are being utlised as per the documentation mentioned here. However, you must understand that utilising all the CPUs does not necessarily may lead to reduction in execution time because:
There is an overhead invovled with creation and allocation of reasources to new threads.
Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
Also multithreading in Python has various shortcomings, see here and here.
You can check out a similar issue with GridSearchCV and parallization here in this answer.
Also, as mentioned by #ncfith, there is no current solution for this problem.
References
Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
Similar issue with numpy on MacOS

sklearn RandomizedSearchCV with Pipelined KerasClassifier

I am performing a hyperparameter tuning optimization tasks with sklearn on a Keras models. I am trying to optimize KerasClassifiers within a Pipeline...
Code follows:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold,RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
my_seed=7
dataframe = pd.read_csv("z:/sonar.all-data.txt", header=None)
dataset = dataframe.values
# split into input and output variables
X = dataset[:,:60].astype(float)
Y = dataset[:,60]
encoder = LabelEncoder()
Y_encoded=encoder.fit_transform(Y)
myScaler = StandardScaler()
X_scaled = myScaler.fit_transform(X)
def create_keras_model(hidden=60):
model = Sequential()
model.add(Dense(units=hidden, input_dim=60, kernel_initializer="normal", activation="relu"))
model.add(Dense(1, kernel_initializer="normal", activation="sigmoid"))
#compile model
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
return model
def create_pipeline(hidden=60):
steps = []
steps.append(('scaler', StandardScaler()))
steps.append(('dl', KerasClassifier(build_fn=create_keras_model,hidden=hidden, verbose=0)))
pipeline = Pipeline(steps)
return pipeline
my_neurons = [15, 30, 60]
my_epochs= [50, 100, 150]
my_batch_size = [5,10]
my_param_grid = dict(hidden=my_neurons, epochs=my_epochs, batch_size=my_batch_size)
model2Tune = KerasClassifier(build_fn=create_keras_model, verbose=0)
model2Tune2 = create_pipeline()
griglia = RandomizedSearchCV(estimator=model2Tune, param_distributions = my_param_grid, n_iter=8 )
griglia.fit(X_scaled, Y_encoded) #this works
griglia2 = RandomizedSearchCV(estimator=create_pipeline, param_distributions = my_param_grid, n_iter=8 )
griglia2.fit(X, Y_encoded) #this does not
We see that RandomizedSearchCV works with griglia, whilst it does not work with griglia2, returning
"TypeError: estimator should be an estimator implementing 'fit'
method, was passed".
Is it possible to amend the code to make it run under a Pipeline object?
Thanks in advance

The estimator parameter wants an object, not a pointer. Currently you are passing a pointer to method which generates the pipeline object. Try adding () to it to solve this:
griglia2 = RandomizedSearchCV(estimator=create_pipeline(), param_distributions = my_param_grid, n_iter=8 )
Now for the second comment about the invalid parameters error. You need to append the name you defined when creating the pipeline to the actual parameters, so that they can be passed successfully.
Look at the description at the of Pipeline usage here.
Use this:
my_param_grid = dict(dl__hidden=my_neurons, dl__epochs=my_epochs,
dl__batch_size=my_batch_size)
Notice the dl__ (with two underscores). This is useful when you want to tune the parameters of multiple objects inside the pipeline.
For example, lets say along with the above parameters, you want to also tune or specify the parameters of StandardScaler.
Then your parameter grid becomes:
my_param_grid = dict(dl__hidden=my_neurons, dl__epochs=my_epochs,
dl__batch_size=my_batch_size,
scaler__with_mean=False)
Hope this clears things.

Keras : GridSearchCV for Hyperparameter Tuning

I'm currently training a CNN for classifying waves. While the code works perfectly, the GridSearchCV for hyperparameter tuning does not work as intended. I was confused because I used similar code for tuning hyperparameters in MLP and it works like a charm. This is the full code, and by the way, I'm using TF as backend.
import pandas as pd
import numpy as np
#Import training set
training_set = pd.read_csv("training_set.csv", delimiter=";")
X_train = training_set.iloc[:,1:].values
y_train = training_set.iloc[:,0:1].values
#Import test set
test_set = pd.read_csv("test_set_v2.csv", delimiter=";")
X_test = test_set.iloc[:,1:].values
y_test = test_set.iloc[:,0:1].values
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)
#Convert X into 3D tensor
X_train = np.reshape(X_train,(X_train.shape[0],X_train.shape[1],1))
X_test = np.reshape(X_test,(X_test.shape[0],X_test.shape[1],1))
#Importing the CNN libraries
from keras.models import Sequential
from keras.layers import Conv1D,MaxPooling1D,Flatten
from keras.layers import Dropout,Dense
from keras.layers.normalization import BatchNormalization
#Parameter tuning
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
def build_classifier(optimizer, dropout1, dropout2):
classifier = Sequential()
classifier.add(Conv1D(filters=4,kernel_size=4,activation='relu',input_shape=(X_train.shape[1],1)))
classifier.add(MaxPooling1D(strides=4))
classifier.add(BatchNormalization())
classifier.add(Flatten())
classifier.add(Dropout(0.25))
classifier.add(Dense(8, activation='relu'))
classifier.add(Dropout(0.25))
classifier.add(Dense(1,activation='sigmoid'))
classifier.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
return classifier
classifier = KerasClassifier(build_fn=build_classifier)
parameters = {'batch_size': [25,32],
'epochs': [5,10],
'optimizer': ['adam', 'rmsprop'],
'dropout1' : [0.2,0.25,3],
'dropout2' : [0.2,0.25,3],
}
grid_search = GridSearchCV(estimator=classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
The strange thing is, it was running perfectly for an epoch then it raises the following error.
File "C:\Program Files\Anaconda3\lib\site-> >packages\keras\wrappers\scikit_learn.py", line 220, in predict
return self.classes_[classes]
IndexError: index 1 is out of bounds for axis 0 with size 1
Can ayone help me? Any kind of help is greatly appreciated! Thanks a lot guys!

SOLVED
Update via github master branch

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scale target values of a Keras autoencoder model using a sklearn pipeline? - python

Related

how to predict multiple dependent columns from 1 independent column

VotingClassifier with pipelines as estimators

kfold cross validation wont terminate, stuck at cross_val_score

sklearn RandomizedSearchCV with Pipelined KerasClassifier

Keras : GridSearchCV for Hyperparameter Tuning

Categories

Resources