Using sklearn voting ensemble with partial fit - python

Can someone please tell how to use ensembles in sklearn using partial fit.
I don't want to retrain my model.
Alternatively, can we pass pre-trained models for ensembling ?
I have seen that voting classifier for example does not support training using partial fit.

The Mlxtend library has an implementation of VotingEnsemble which allows you to pass in pre-fitted models. For example if you have three pre-trained models clf1, clf2, clf3. The following code would work.
from mlxtend.classifier import EnsembleVoteClassifier
import copy
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1], fit_base_estimators=False)
When set to false the fit_base_estimators argument in EnsembleVoteClassifier ensures that the classifiers are not refit.
In general, when looking for more advanced technical features that sci-kit learn does not provide, look to mlxtend as a first point of reference.

Workaround:
VotingClassifier checks that estimators_ is set in order to understand whether it is fitted, and is using the estimators in estimators_ list for prediction.
If you have pre trained classifiers, you can put them in estimators_ directly like the code below.
However, it is also using LabelEnconder, so it assumes labels are like 0,1,2,... and you also need to set le_ and classes_ (see below).
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
clf_list = [clf1, clf2, clf3]
eclf = VotingClassifier(estimators = [('1' ,clf1), ('2', clf2), ('3', clf3)], voting='soft')
eclf.estimators_ = clf_list
eclf.le_ = LabelEncoder().fit(y)
eclf.classes_ = seclf.le_.classes_
# Now it will work without calling fit
eclf.predict(X,y)

Unfortunately, currently this is not possible in scikit VotingClassifier.
But you can use http://sebastianraschka.com/Articles/2014_ensemble_classifier.html (from which VotingClassifer is implemented) to try and implement your own voting classifier which can take pre-fitted models.
Also we can look at the source code here and modify it to our use:
from sklearn.preprocessing import LabelEncoder
import numpy as np
le_ = LabelEncoder()
# When you do partial_fit, the first fit of any classifier requires
all available labels (output classes),
you should supply all same labels here in y.
le_.fit(y)
# Fill below list with fitted or partial fitted estimators
clf_list = [clf1, clf2, clf3, ... ]
# Fill weights -> array-like, shape = [n_classifiers] or None
weights = [clf1_wgt, clf2_wgt, ... ]
weights = None
#For hard voting:
pred = np.asarray([clf.predict(X) for clf in clf_list]).T
pred = np.apply_along_axis(lambda x:
np.argmax(np.bincount(x, weights=weights)),
axis=1,
arr=pred.astype('int'))
#For soft voting:
pred = np.asarray([clf.predict_proba(X) for clf in clf_list])
pred = np.average(pred, axis=0, weights=weights)
pred = np.argmax(pred, axis=1)
#Finally, reverse transform the labels for correct output:
pred = le_.inverse_transform(np.argmax(pred, axis=1))

It's not too hard to implement the voting. Here's my implementation:
import numpy as np
class VotingClassifier(object):
""" Implements a voting classifier for pre-trained classifiers"""
def __init__(self, estimators):
self.estimators = estimators
def predict(self, X):
# get values
Y = np.zeros([X.shape[0], len(self.estimators)], dtype=int)
for i, clf in enumerate(self.estimators):
Y[:, i] = clf.predict(X)
# apply voting
y = np.zeros(X.shape[0])
for i in range(X.shape[0]):
y[i] = np.argmax(np.bincount(Y[i,:]))
return y

The Mlxtend library has an implementation works, you still need to call the fit function for the EnsembleVoteClassifier. Seems the fit function doesn't really modify any parameters rather checking the possible label values. In the example below, you have to give an array contains all the possible values appear in original y(in this case 1,2) to eclf2.fit It doesn't matter for X.
import numpy as np
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
import copy
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
for clf in (clf1, clf2, clf3):
clf.fit(X, y)
eclf2 = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],voting="soft",refit=False)
eclf2.fit(None,np.array([1,2]))
print(eclf2.predict(X))

Related

Transformed Target Regressor with Predict Parameters

I am using Scikit Learn and Gaussian Processes Regression for a problem as well as the built in Transformed Target Regressor function.
The issue I face is GPR allows for predictions to be returned with their standard deviations. In this case the estimator actually returns a tuple with two numpy arrays (one for the mean, the other for std). However the transformed target regressor function only expects a numpy array and therefore breaks when using the predict method with 'return_std=True'.
I have dropped in a really simple example to demonstrate this. Its meant to be representative of an actual problem hence the inclusion of a pipeline however with no pre-processing steps. There are also some lines commented out that would demonstrate how the predict method works without the transformed target regressor.
Would like to hear if there is anyway around this short of implementing the transformer on the predictions myself manually.
#%% Imports
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, DotProduct, RationalQuadratic, Matern)
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
#%% Generate Data
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
X_train, y_train = X[training_indices], y[training_indices]
#%% Fit Model
kernel = 1 * RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e2))
# Standard Estimator
# estimator = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)
# Transformed Estimator
estimator = TransformedTargetRegressor(
regressor = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9),
transformer=PowerTransformer(method='yeo-johnson')
)
pipe = Pipeline(
steps=[
("estimator", estimator)
]
)
pipe.fit(X_train, y_train)
#%% Predict
# No parameters - Prediction returns numpy array
# pipe.predict(X)
# Std Parameter - Prediction returns tuple of numpy arrays
mean_prediction, std_prediction = pipe.predict(X, return_std=True)

MultiOutputClassifier only returns learned data

As the title says, I am testing python MultiOutputClassifier, to fix a problem that requires determining coordinates (x,y) as output, given 3 inputs and it only returns as prediction the closest learned value, not the 'extrapolated' one.
My sample is code is as follows:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
train_data = np.array([
[-30,-60,-90,0,0],
[-50,-50,-50,10,0],
[-90,-60,-30,20,0],
[-50,-50,-95,0,10],
[-60,-30,-60,10,10],
[-95,-50,-50,20,10],
])
# These I just made up
test_data_x = np.array([
[-35,-50,-90],
])
x = train_data[:, :3]
y = train_data[:, 3:]
forest = RandomForestClassifier(n_estimators=100, random_state=1)
classifier = MultiOutputClassifier(forest, n_jobs=-1)
classifier.fit(x,y)
print classifier.predict(test_data_x)
This returns 0,10, but I would expect that for the given inputs the output should be something like 5,5; somewhere between two of the learned values.
I see that there is something I am doing wrong or I misunderstood. Any help with this issue? Is it that the MultiOutputClassifier is not the right thing?
The problem here is that a (Random Forest) Classifier won't extrapolate. It can only output values it has already seen. You probably want to use a regressor.
Replacing "Classifier" with "Regressor" in your code yields (0.8, 5.8) as output, which seems closer to what you expected.
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
train_data = np.array([
[-30,-60,-90,0,0],
[-50,-50,-50,10,0],
[-90,-60,-30,20,0],
[-50,-50,-95,0,10],
[-60,-30,-60,10,10],
[-95,-50,-50,20,10],
])
test_data_x = np.array([
[-35,-50,-90],
])
x = train_data[:, :3]
y = train_data[:, 3:]
forest = RandomForestRegressor(n_estimators=100, random_state=1)
classifier = MultiOutputRegressor(forest, n_jobs=-1)
classifier.fit(x,y)
print(classifier.predict(test_data_x))

sklearn class_weight in random forest does the opposite of what I expect

I'm using the RandomForestClassifier of sklearn on data with highly imbalanced classes--lots of 0 and few of 1. I'm interested in the number of 1s in the prediction. Example (credit):
# Load libraries
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Make class highly imbalanced by removing first 40 observations
X = X[46:,:]
y = y[46:]
# Create target vector indicating if class 0, otherwise 1
y = np.where((y == 0), 1, 0)
#split into training and testing
trainx = X[::2]
trainy = y[::2]
testx = X[1::2]
testy = y[1::2]
# Create decision tree classifer object
clf = RandomForestClassifier()
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())
This returns 2. Which is fine, except on my real data the result is a bit lower than the true answer. I want to deal with this by using the class_weight parameter. However when I do:
clf = RandomForestClassifier(class_weight="balanced")
# Train model
clf.fit(trainx, trainy)
print(clf.predict(testx).sum())
I get a result of 0. Same if I use class_weight={1:10}. If I use class_weight={1:.1} I get 2 again.
I get similar behavior on my real data: the higher the weight I give to class 1, the fewer 1s I get in the prediction.
This is the opposite of the behavior I expect (and the opposite of what the class_weight parameter does in svm). What's going on here? This question suggests sklearn is assigning the class labels by some kind of default, but that seems bizarre. Why wouldn't it use the class labels I gave it?

How to set 'update_weights' in OpenCV's MLP implementation?

I'm used to using sklearn, whose documentation I find very easy to follow. However, I now need to learn to use OpenCV - in particular, I need to be able to use an MLP classifier, and to update its weights as new training data comes in.
In sklearn, this can be done using the partial_fit method. According to the OpenCV documentation, there is an UPDATE_WEIGHTS flag that can be set, but I can't figure out how to include it in my code.
Here's a MCVE of what I have so far:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np
import cv2
from sklearn.neural_network import MLPClassifier
def softmax(x):
softmaxes = np.zeros(x.shape)
for i in range(x.shape[1]):
softmaxes[:, i] = np.exp(x)[:, i]/np.sum(np.exp(x), axis=1)
return softmaxes
data = load_breast_cancer()
X = data.data
y = data.target.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1729)
y2 = np.zeros((y_train.shape[0], 2))
y2[:,0] = np.where(y_train==0, 1, 0)
y2[:,1] = np.where(y_train==1, 1, 0)
ann = cv2.ml.ANN_MLP_create()
ann.setLayerSizes(np.array([X.shape[1], y2.shape[1]]))
ann.setActivationFunction(cv2.ml.ANN_MLP_SIGMOID_SYM)
ann.train(np.float32(X_train), cv2.ml.ROW_SAMPLE, np.float32(y2))
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
preds_proba = softmax(ann.predict(np.float32(X_test))[1])
print(roc_auc_score(y_test, preds_proba[:,1]))
print(roc_auc_score(y_test, mlp.predict_proba(X_test)[:,1]))
As the score between the OpenCV classifier and the sklearn learn one are comparable, I'm pretty confident it's implemented correctly.
How can I modify this code, so that when a new training sample comes in, I can update the weights based on that sample alone, rather than retraining on the entire train set?
The equivalent in sklearn would be:
mlp.partial_fit(X_new_sample, y_new_sample).
Figured out the answer. The syntax is the following:
ann.train(cv2.ml.TrainData_create(np.float32(X), 0, np.float32(y)), flags=1)

Getting probability result with scikit-learn [duplicate]

I am using scikit-learn's linearSVC classifier for text mining. I have the y value as a label 0/1 and the X value as the TfidfVectorizer of the text document.
I use a pipeline like below
pipeline = Pipeline([
('count_vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
('classifier', LinearSVC())
])
For a prediction, I would like to get the confidence score or probability of a data point being classified as
1 in the range (0,1)
I currently use the decision function feature
pipeline.decision_function(test_X)
However it returns positive and negative values that seem to indicate confidence. I am not too sure about what they mean either.
However, is there a way to get the values in range 0-1?
For example here is the output of the decision function for some of the data points
-0.40671879072078421,
-0.40671879072078421,
-0.64549376401063352,
-0.40610652684648957,
-0.40610652684648957,
-0.64549376401063352,
-0.64549376401063352,
-0.5468745098794594,
-0.33976011539714374,
0.36781572474117097,
-0.094943829974515004,
0.37728641897721765,
0.2856211778200019,
0.11775493140003235,
0.19387473663623439,
-0.062620918785563556,
-0.17080866610522819,
0.61791016307670399,
0.33631340372946961,
0.87081276844501176,
1.026991628346146,
0.092097790098391641,
-0.3266704728249083,
0.050368652422013376,
-0.046834129250376291,
You can't.
However you can use sklearn.svm.SVC with kernel='linear' and probability=True
It may run longer, but you can get probabilities from this classifier by using predict_proba method.
clf=sklearn.svm.SVC(kernel='linear',probability=True)
clf.fit(X,y)
clf.predict_proba(X_test)
If you insist on using the LinearSVC class, you can wrap it in a sklearn.calibration.CalibratedClassifierCV object and fit the calibrated classifier which will give you a probabilistic classifier.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Here is the output:
[[ 9.98626760e-01 1.27594869e-03 9.72912751e-05]
[ 9.99578199e-01 1.79053170e-05 4.03895759e-04]]
which shows probabilities for each class for each data point.

Categories

Resources