VotingClassifier with pipelines as estimators

VotingClassifier with pipelines as estimators - python

I want to build an sklearn VotingClassifier ensemble out of multiple different models (Decision Tree, SVC, and a Keras Network). All of them need a different kind of data preprocessing, which is why I made a pipeline for each of them.
# Define pipelines
# DTC pipeline
featuriser = Featuriser()
dtc = DecisionTreeClassifier()
dtc_pipe = Pipeline([('featuriser',featuriser),('dtc',dtc)])
# SVC pipeline
scaler = TimeSeriesScalerMeanVariance(kind='constant')
flattener = Flattener()
svc = SVC(C = 100, gamma = 0.001, kernel='rbf')
svc_pipe = Pipeline([('scaler', scaler),('flattener', flattener), ('svc', svc)])
# Keras pipeline
cnn = KerasClassifier(build_fn=get_model())
cnn_pipe = Pipeline([('scaler',scaler),('cnn',cnn)])
# Make an ensemble
ensemble = VotingClassifier(estimators=[('dtc', dtc_pipe),
('svc', svc_pipe),
('cnn', cnn_pipe)],
voting='hard')
The Featuriser,TimeSeriesScalerMeanVariance and Flattener classes are some custom made transformers that all employ fit,transform and fit_transform methods.
When I try to ensemble.fit(X, y) fit the whole ensemble I get the error message:
ValueError: The estimator list should be a classifier.
Which I can understand, as the individual estimators are not specifically classifiers but pipelines. Is there a way to still make it work?

The problem is with the KerasClassifier. It does not provide the _estimator_type, which was checked in _validate_estimator.
It is not the problem of using pipeline. Pipeline provides this information as a property. See here.
Hence, the quick fix is setting _estimator_type='classifier'.
A reproducible example:
# Define pipelines
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.ensemble import VotingClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.datasets import make_classification
from keras.layers import Dense
from keras.models import Sequential
X, y = make_classification()
# DTC pipeline
featuriser = MinMaxScaler()
dtc = DecisionTreeClassifier()
dtc_pipe = Pipeline([('featuriser', featuriser), ('dtc', dtc)])
# SVC pipeline
scaler = Normalizer()
svc = SVC(C=100, gamma=0.001, kernel='rbf')
svc_pipe = Pipeline(
[('scaler', scaler), ('svc', svc)])
# Keras pipeline
def get_model():
# create model
model = Sequential()
model.add(Dense(10, input_dim=20, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
cnn = KerasClassifier(build_fn=get_model)
cnn._estimator_type = "classifier"
cnn_pipe = Pipeline([('scaler', scaler), ('cnn', cnn)])
# Make an ensemble
ensemble = VotingClassifier(estimators=[('dtc', dtc_pipe),
('svc', svc_pipe),
('cnn', cnn_pipe)],
voting='hard')
ensemble.fit(X, y)

Related

how to predict multiple dependent columns from 1 independent column

is it possible to predict multiple dependent columns from independent columns?
Problem Statement: I have to predict 5 factors(cEXT, cNEU,cAGR, cCON, cOPN) on the basis of STATUS column, so input variable will be STATUS column only and target variables are (cEXT, cNEU,cAGR, cCON, cOPN).
here in the above data STATUS is an independent column and cEXT, cNEU,cAGR, cCON, cOPN are the dependent columns, how can I predict those?
# independent and dependent variable split
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
right now I am predicting only one column so repeating the same thing 5 times so I am creating 5 models for 5 target variables.
Code:
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
ct = ColumnTransformer([
('step1', TfidfVectorizer(), 'STATUS')
],remainder='drop')
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics
from sklearn.pipeline import Pipeline
# ##########
# RandomForest
# ##########
model = Pipeline([
('column_transformers', ct),
('model', RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto')),
])
# creating 5 models, can I create 1 model?
model_cEXT = model.fit(X_train, y_train['cEXT'])
model_cNEU = model.fit(X_train, y_train['cNEU'])
model_cAGR = model.fit(X_train, y_train['cAGR'])
model_cCON = model.fit(X_train, y_train['cCON'])
model_cOPN = model.fit(X_train, y_train['cOPN'])

You can use multioutput classifier from scikit-learn.
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
clf = MultiOutputClassifier(RandomForestClassifier()).fit(X_train, y_train)
clf.predict(X_test)
Reference:
Official document of MultiOutputClassifier

There is a library scikit-multilearn which is very good for these tasks. There are several ways to do multi-label classification such as PowerSet, ClassifierChain etc. These are very well covered in this library.
Below is a sample of how it will replace your current code.
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
# Rest of your code
==========================
# The new code
from skmultilearn.problem_transform import BinaryRelevance
from scipy.sparse import csr_matrix
classifier = BinaryRelevance(
classifier = RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto'),
require_dense = [False, True]
)
model = Pipeline([
('column_transformers', ct),
('classifier', classifier),
])
model.fit(X_train, y_train.values)
res = model.predict(X_test)
res = csr_matrix(res)
res.todense()
You can explore other methods here.
In TensorFlow you can do this using sigmoid activation and binaryCE loss on all the units. As below:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
tfidf_calculator = TextVectorization(
standardize = 'lower_and_strip_punctuation',
split = 'whitespace',
max_tokens = 100,
output_mode ='tf-idf',
pad_to_max_tokens=False)
tfidf_calculator.adapt(df['Status'].values)
tfids = tfidf_calculator(df['Status'])
X = tfids.numpy()
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(100,)),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(5, activation='sigmoid')
])
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())
model.fit(X_train, y_train, epochs=20, batch_size=32)
The thing to take note of in TensorFlow is that you need a dense matrix as input. There might be a way to use sparse but I didn't find any.

Newbie : How evaluate model to increase accuracy model in classification

my data
how do I increase the accuracy of the model, if some of my models when run produce results like the one below
`
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6780893042575286
`
Random Forest Classifier : Accuracy: 0.6780893042575286

There are several ways to achieve this:
Look at the data. Are they in the best shape for the algorithm? Regarding NaN, Covariance and so on? Are they normalized, are the categorical ones translated well? This is a question too far-reaching for a forum.
Look at the problem and the different algorithm suitable for this problem. Maybe
Logistic Regression
SVN
XGBoost
....
Try hyper parameter tuning with RandomisedsearvCV or GridSearchCV
This is quite high-level.

In terms of model selection, you can use a function like the below to find a good model that suits the problem.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
def mutli_model(X_train, y_train, X_test, y_test):
""" Function to determine best model archietecture """
dfs = []
models = [
('LogReg', LogisticRegression()),
('RF', RandomForestClassifier()),
('KNN', KNeighborsClassifier()),
('SVM', SVC()),
('GNB', GaussianNB()),
('XGB', XGBClassifier(eval_metric="error"))
]
results = []
names = []
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
target_names = ['App_Status_1', 'App_Status_2']
for name, model in models:
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
clf = model.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(name)
print(classification_report(y_test, y_pred, target_names=target_names))
results.append(cv_results)
names.append(name)
this_df = pd.DataFrame(cv_results)
this_df['model'] = name
dfs.append(this_df)
final = pd.concat(dfs, ignore_index=True)
return final
After model selection, you can do something called Hyperparameter tuning which will further increase the model's performance.
If you want to further improve the model, you implement techniques like Data Augmentation and also revisit the cleaning phase of your data.
If after all that, if it still doesn't improve you could try collecting more data or refocus the problem statement.

How to scale target values of a Keras autoencoder model using a sklearn pipeline?

I'm using sklearn pipelines to build a Keras autoencoder model and use gridsearch to find the best hyperparameters. This works fine if I use a Multilayer Perceptron model for classification; however, in the autoencoder I need the output values to be the same as input. In other words, I am using a StandardScalar instance in the pipeline to scale the input values and therefore this leads to my question: how can I make the StandardScalar instance inside the pipeline to work on both the input data as well as target data, so that they end up to be the same?
I'm providing a code snippet as an example.
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop, Adam
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
X, y = make_classification (n_features = 50, n_redundant = 0, random_state = 0,
scale = 100, n_clusters_per_class = 1)
# Define wrapper
def create_model (learn_rate = 0.01, input_shape, metrics = ['mse']):
model = Sequential ()
model.add (Dense (units = 64, activation = 'relu',
input_shape = (input_shape, )))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (8, activation = 'relu'))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (input_shape, activation = None))
model.compile (loss = 'mean_squared_error',
optimizer = Adam (lr = learn_rate),
metrics = metrics)
return model
# Create scaler
my_scaler = StandardScaler ()
steps = list ()
steps.append (('scaler', my_scaler))
standard_scaler_transformer = Pipeline (steps)
# Create classifier
clf = KerasRegressor (build_fn = create_model, verbose = 2)
# Assemble pipeline
# How to scale input and output??
clf = Pipeline (steps = [('scaler', my_scaler),
('classifier', clf)],
verbose = True)
# Run grid search
param_grid = {'classifier__input_shape' : [X.shape [1]],
'classifier__batch_size' : [50],
'classifier__learn_rate' : [0.001],
'classifier__epochs' : [5, 10]}
cv = KFold (n_splits = 5, shuffle = False)
grid = GridSearchCV (estimator = clf, param_grid = param_grid,
scoring = 'neg_mean_squared_error', verbose = 1, cv = cv)
grid_result = grid.fit (X, X)
print ('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))

You can use TransformedTargetRegressor to apply arbitrary transformations on the target values (i.e. y) by providing either a function (i.e. using func argument) or a transformer (i.e. transformer argument).
In this case (i.e. fitting an auto-encoder model), since you want to apply the same StandardScalar instance on the target values as well, you can use transformer argument. And it could be done in one of the following ways:
You can use it as one of the pipeline steps, wrapping the regressor:
scaler = StandardScaler()
regressor = KerasRegressor(...)
pipe = Pipeline(steps=[
('scaler', scaler),
('ttregressor', TransformedTargetRegressor(regressor, transformer=scaler))
])
# Use `__regressor` to access the regressor hyperparameters
param_grid = {'ttregressor__regressor__hyperparam_name' : ...}
gridcv = GridSearchCV(estimator=pipe, param_grid=param_grid, ...)
gridcv.fit(X, X)
Alternatively, you can wrap it around the GridSearchCV like this:
ttgridcv = TransformedTargetRegressor(GridSearchCV(...), transformer=scalar)
ttgridcv.fit(X, X)
# Use `regressor_` attribute to access the fitted regressor (i.e. `GridSearchCV` instance)
print(ttgridcv.regressor_.best_score_, ttgridcv.regressor_.best_params_))

I am not sure why decision tree and random forest is displaying 100% accuracy?

I am currently working on a model that reads structured data and determines if someone has a disease. I think the issue is the data is not being split between training and testing data. I am unaware of how I would be able to do that.
I am not sure what to try.
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
heart_data = pd.read_csv('cardio_train.csv')
heart_data.head()
heart_data.shape
heart_data.describe()
heart_data.isnull().sum()
heart_data_columns = heart_data.columns
predictors = heart_data[heart_data_columns[heart_data_columns != 'target']] # all columns except Breast Cancer
target = heart_data['target'] # Breast Cancer column
#This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type
predictors.head()
target.head()
#normalize the data by subtracting the mean and dividing by the standard deviation.
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()
n_cols = predictors_norm.shape[1] # number of predictors
def regression_model():
# create model
model = Sequential()
#inputs
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))
model.add(Dense(50, activation='relu')) # activation function
model.add(Dense(1))
# compile model
model.compile(optimizer='adam', loss='mean_squared_error')
#loss measures the results and figures out how bad it did. Optimizer generates next guess.
return model
# build the model
model = regression_model()
print (model)
# fit the model
history=model.fit(predictors_norm, target, validation_split=0.3, epochs=10, verbose=2)
#Decision Tree
print ("Processing Decision Tree")
dtc = DecisionTreeClassifier()
dtc.fit(predictors_norm,target)
print("Decision Tree Test Accuracy {:.2f}%".format(dtc.score(predictors_norm, target)*100))
#Support Vector Machine
print ("Processing Support Vector Machine")
svm = SVC(random_state = 1)
svm.fit(predictors_norm, target)
print("Test Accuracy of SVM Algorithm: {:.2f}%".format(svm.score(predictors_norm,target)*100))
#Random Forest
print ("Processing Random Forest")
rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(predictors_norm, target)
print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(rf.score(predictors_norm,target)*100))
The message i am getting is this
Decision Tree Test Accuracy 100.00%
However, support vector machine is getting 73.37%

You are evaluating your model on the same data as when you trained it : you are probably overfitting. To overcome this, you must separate the data into two parts, one for learning, one for testing :
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2)
Then, learn your model with the train dataset and evaluate it on the test dataset :
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
accuracy = dtc.score(x_test, y_test) * 100
print(f"Decision Tree test accuracy : {accuracy} %.")

Keras : GridSearchCV for Hyperparameter Tuning

I'm currently training a CNN for classifying waves. While the code works perfectly, the GridSearchCV for hyperparameter tuning does not work as intended. I was confused because I used similar code for tuning hyperparameters in MLP and it works like a charm. This is the full code, and by the way, I'm using TF as backend.
import pandas as pd
import numpy as np
#Import training set
training_set = pd.read_csv("training_set.csv", delimiter=";")
X_train = training_set.iloc[:,1:].values
y_train = training_set.iloc[:,0:1].values
#Import test set
test_set = pd.read_csv("test_set_v2.csv", delimiter=";")
X_test = test_set.iloc[:,1:].values
y_test = test_set.iloc[:,0:1].values
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)
#Convert X into 3D tensor
X_train = np.reshape(X_train,(X_train.shape[0],X_train.shape[1],1))
X_test = np.reshape(X_test,(X_test.shape[0],X_test.shape[1],1))
#Importing the CNN libraries
from keras.models import Sequential
from keras.layers import Conv1D,MaxPooling1D,Flatten
from keras.layers import Dropout,Dense
from keras.layers.normalization import BatchNormalization
#Parameter tuning
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
def build_classifier(optimizer, dropout1, dropout2):
classifier = Sequential()
classifier.add(Conv1D(filters=4,kernel_size=4,activation='relu',input_shape=(X_train.shape[1],1)))
classifier.add(MaxPooling1D(strides=4))
classifier.add(BatchNormalization())
classifier.add(Flatten())
classifier.add(Dropout(0.25))
classifier.add(Dense(8, activation='relu'))
classifier.add(Dropout(0.25))
classifier.add(Dense(1,activation='sigmoid'))
classifier.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
return classifier
classifier = KerasClassifier(build_fn=build_classifier)
parameters = {'batch_size': [25,32],
'epochs': [5,10],
'optimizer': ['adam', 'rmsprop'],
'dropout1' : [0.2,0.25,3],
'dropout2' : [0.2,0.25,3],
}
grid_search = GridSearchCV(estimator=classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
The strange thing is, it was running perfectly for an epoch then it raises the following error.
File "C:\Program Files\Anaconda3\lib\site-> >packages\keras\wrappers\scikit_learn.py", line 220, in predict
return self.classes_[classes]
IndexError: index 1 is out of bounds for axis 0 with size 1
Can ayone help me? Any kind of help is greatly appreciated! Thanks a lot guys!

SOLVED
Update via github master branch

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

VotingClassifier with pipelines as estimators - python

Related

how to predict multiple dependent columns from 1 independent column

Newbie : How evaluate model to increase accuracy model in classification

How to scale target values of a Keras autoencoder model using a sklearn pipeline?

I am not sure why decision tree and random forest is displaying 100% accuracy?

Keras : GridSearchCV for Hyperparameter Tuning

Categories

Resources