How to retrain pipeline with different data in Scikit-learn?

How to retrain pipeline with different data in Scikit-learn? - python

I do a machine learning model training with pipelines, K-fold cross validation with Python and sklearn on a subset of my all historical data (omitting a test set), along the following:
pipeline = Pipeline([("combiner", PolynomialFeatures()),
("dimred", PCA()),
("classifier", RandomForestClassifier())])
parameters = [...]
CV = GridSearchCV(pipeline, parameters, cv=5, scoring="f1_weighted", refit=True, n_jobs=-1)
CV.fit(train_X, train_y)
So far, so good. However, at the end, I want to retrain the winning pipeline hyperparameter combination on my full X and y, without any cross validation. How could I have this? Simply applying CV.fit(X, y) again would re-doing the whole alternating process with CV, which is obviously unnecessary. I could also parse CV.get_params() for the best combination hyperparameters and build up the pipeline again accordingly, but this somehow seems clumsy and unprofessional...

The answer to your question is in the GridSearchCV documentation. See the Attributes section: best_estimator_ is where the best model is stored, so you can access it from there after you are done with fitting. You can use it by directly calling `CV.best_estimatory_', you can make a new reference to it or pickle it for later using joblib, ie.:
import joblib
joblib.dump(CV.best_estimator_, 'my_pipeline.pkl')
Later you can load your model for further work:
import joblib
my_pipeline = joblib.load('my_pipeline.pkl')
If you do not need the model, but only its hyperparameters you can access those from the best_params_ attribute, ie.:
CV.best_params_
which is a dictionary the best settings that you can use to construct a new pipeline.

Related

Sklearn - fit, scale and transform

The fit() method in sklearn appears to be serving different purposes in same interface.
When applied to the training set, like so:
model.fit(X_train, y_train)
fit() is used to learn parameters that will later be used on the test set with predict(X_test)
However, there are cases when there is no 'learning' involved with fit(), but only some normalization to transform the data, like so:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(X_train)
which will simply scale feature values between, say, 0 and 1, to avoid some features with higher variance to have a disproportional influence on the model.
To make things even less intuitive, sometimes the fit() method that scales (and already appears to be transforming) needs to be followed by further transform() method, before being called again with the fit() that actually learns and builds the model, like so:
X_train2 = min_max_scaler.transform(X_train)
X_test2 = min_max_scaler.transform(X_test)
# the model being used
knn = KNeighborsClassifier(n_neighbors=3,metric="euclidean")
# learn parameters
knn.fit(X_train2, y_train)
# predict
y_pred = knn.predict(X_test2)
Could someone please clarify the use, or multiple uses, of fit(), as well as the difference of scaling and transforming the data?

fit() function provides a common interface that is shared among all scikit-learn objects.
This function takes as argument X ( and sometime y array to compute the object's statistics. For example, calling fit on a MinMaxScaler transformer will compute its statistics (data_min_, data_max_, data_range_...
Therefore we should see the fit() function as a method that compute the necessary statistics of an object.
This commons interface is really helpful as it allows to combine transformer and estimators together using a Pipeline. This allows to compute and predict all steps in one go as follows:
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
X, y = make_classification(n_samples=1000)
model = make_pipeline(MinMaxScaler(), NearestNeighbors())
model.fit(X, y)
This offers also the possibility to serialize the whole model into one single object.
Without this composition module, I can agree with you that it is not very practically to work with independent transformer and estimator.

In scikit-learn there are 3 classes that share interface: Estimators, Transformers and Predictors
Estimators have fit() function, which serves always the same purpose. It estimates parameters based on the dataset.
Transformers have transform() function. It returns the transformed dataset. Some Estimators are also Transformers, e.g. MinMaxScaler()
Predictors have predict() function, which returns predictions on new instances, e.g. KNeighborsClassifier()
Both MinMaxScaler() and KNeighborClassifier() contain fit() method, because they share interface of an Estimator.
However, there are cases when there is no 'learning' involved with fit()
There is 'learning' involved. Transformer, MinMaxScaler() has to 'learn' min and max values for each numerical feature.
When you call min_max_scaler.fit(X_train) your scaler estimates values for each numerical column in your train set. min_max_scaler.transform(X_train) scales your train set based on the estimations. min_max_scaler.transform(X_test) scales the test set with the estimations learned for train set. This is important to scale both train and test set with the same estimations.
For further reading, you can check this: https://arxiv.org/abs/1309.0238

Fit a scikit-learn model in parallel?

Is it possible to fit a scikit-learn model in parallel? Something along the lines of
model.fit(X, y, n_jobs=20)

It really depends on the model you are trying to fit. Usually it will have an n_jobs parameter when you initialize the model. See glossary on n_jobs. For example random forest:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=10)
If it is an ensemble method, it makes sense to parallelize because you can fit models separately (see help page for ensemble methods). LogisticRegression() also has an n_job option but I honestly don't know how much this speeds up the fitting process, if that's your bottle neck. See also this post
Other methods like elastic net, linear regression or SVM, i don't think there's a parallelization option.

Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?

The following code combines cross_validate with GridSearchCV to perform a nested cross-validation for an SVC on the iris dataset.
(Modified example of the following documentation page:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py.)
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np
np.set_printoptions(precision=2)
# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# Set up possible values of parameters to optimize over
p_grid = {"C": [1, 10],
"gamma": [.01, .1]}
# We will use a Support Vector Classifier with "rbf" kernel
svm = SVC(kernel="rbf")
# Choose techniques for the inner and outer loop of nested cross-validation
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)
# Perform nested cross-validation
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv, iid=False)
clf.fit(X_iris, y_iris)
best_estimator = clf.best_estimator_
cv_dic = cross_validate(clf, X_iris, y_iris, cv=outer_cv, scoring=['accuracy'], return_estimator=False, return_train_score=True)
mean_val_score = cv_dic['test_accuracy'].mean()
print('nested_train_scores: ', cv_dic['train_accuracy'])
print('nested_val_scores: ', cv_dic['test_accuracy'])
print('mean score: {0:.2f}'.format(mean_val_score))
cross_validate splits the data set in each fold into a training and a test set. In each fold, the input estimator is then trained based on the training set associated with the fold. The inputted estimator here is clf, a parameterized GridSearchCV estimator, i.e. an estimator that cross-validates itself again.
I have three questions about the whole thing:
If clf is used as the estimator for cross_validate, does it (in the course of the GridSearchCV cross validation) split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Out of all models tested via GridSearchCV, does cross_validate validate only the model stored in the best_estimator_ attribute?
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
To make it clearer how the questions are meant, here is an illustration of how I imagine the double cross validation at the moment.

If clf is used as the estimator for cross_validate, does it split the above mentioned training set into a subtraining set and a validation set in order to determine the best hyper parameter combination?
Yes as you can see here at Line 230 the training set is again split into a subtraining and validation set (Specifically at line 240).
Update Yes, when you will pass the GridSearchCV classifier into cross-validate it will again split the training set into a test and train set. Here is a link describing this in more detail. Your diagram and assumption is correct.
Out of all models tested via GridSearchCV, does cross_validate train & validate only the model stored in the variable best_estimator?
Yes, as you can see from the answers here and here, the GridSearchCV returns the best_estimator in your case(since refit parameter is True by default in your case.) However, this best estimator will has to be trained again
Does cross_validate train a model at all (if so, why?) or is the model stored in best_estimator_ validated directly via the test set?
As per your third and final question, Yes, it trains an estimator and returns it if return_estimator is set to True. See this line. Which makes sense, since how else is it supposed to return the scores without training an estimator in the first place ?
Update
The reason the model is trained again is because the default use case for cross-validate does not assume that you give in the best classfier with the optimum parameters. In this case specifically, you are sending in a classifier from the GridSearchCV but if you send any untrained classifier it is supposed to be trained. What I mean to say here is that, yes, in your case it shouldn't train it again since you are already doing cross-validation using GridSearchCV and using the best estimator. However, there is no way for cross-validate to know this, hence, it assumes that you are sending in an un-optimized or rather untrained estimator, thus it has to train it again and return the scores for the same.

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To make my question clearer, what are steps? How do they work?
Edit
Thanks to the answers I can make my question clearer:
When I call pipeline and pass, as steps, two transformers and one estimator, e.g:
pipln = Pipeline([("trsfm1",transformer_1),
("trsfm2",transformer_2),
("estmtr",estimator)])
What happens when I call this?
pipln.fit()
OR
pipln.fit_transform()
I can't figure out how an estimator can be a transformer and how a transformer can be fitted.

Transformer in scikit-learn - some class that have fit and transform method, or fit_transform method.
Predictor - some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage.
Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
With just:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor.
Answer to edit:
When you call pipln.fit() - each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.

I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:
Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().
As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
bin = LabelBinarizer() #first we initialize
vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer 'knows' about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.
print bin.classes_
I get the following error when trying this:
AttributeError: 'LabelBinarizer' object has no attribute 'classes_'
But when you fit the binarizer on the vec list:
bin.fit(vec)
and try again
print bin.classes_
I get the following:
['cat' 'dog']
print bin.transform(vec)
And now, after calling transform on the vec object, we get the following:
[[0]
[1]
[1]
[1]]
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.
What are ML pipelines and how do they work?
A pipeline is a series of steps in which data is transformed. It comes from the old "pipe and filter" design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Pipelining Machine Learning as:
Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step's output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.
Methods of a Scikit-Learn Pipeline
Pipelines (or steps in the pipeline) must have those two methods:
“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform" (or "predict") to actually process the data and generate a prediction.
It's also possible to call this method to chain both:
“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
Problems of the sklearn.pipeline.Pipeline class
Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.
Solutions that we’ve Found to Those Scikit-Learn's Problems
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Inability to Reasonably do Automatic Machine Learning (AutoML)
Problem: Defining the Search Space (Hyperparameter Distributions)
Problem: Defining Hyperparameters in the Constructor is Limiting
Problem: Different Train and Test Behavior
Problem: You trained a Pipeline and You Want Feedback on its Learning.
Inability to Reasonably do Deep Learning Pipelines
Problem: Scikit-Learn Hardly Allows for Mini-Batch Gradient Descent (Incremental Fit)
Problem: Initializing the Pipeline and Deallocating Resources
Problem: It is Difficult to Use Other Deep Learning (DL) Libraries in Scikit-Learn
Problem: The Ability to Transform Output Labels
Not ready for Production nor for Complex Pipelines
Problem: Processing 3D, 4D, or ND Data in your Pipeline with Steps Made for Lower-Dimensional Data
Problem: Modify a Pipeline Along the Way, such as for Pre-Training or Fine-Tuning
Problem: Getting Model Attributes from Scikit-Learn Pipeline
Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized "as-is" by Joblib
Additional pipeline methods and features offered through Neuraxle
Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
As a starter, it is possible for pipelines or their steps to also optionally define those methods:
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.
The following methods are provided by default to allow for managing hyperparameters:
“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.
For more info on our suggested solutions, read the entries in the big list with links above.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pandas as pd
class TextTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование текстовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
return X[self.key]
class NumberTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование числовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
def fit_predict(model, X_train, X_test, y_train, y_test):
vec_tdidf = TfidfVectorizer(ngram_range=(2,2), analyzer='word', norm='l2')
text = Pipeline([
('transformer', TextTransformer(key='clear_messages')),
('vectorizer', vec_tdidf)
])
word_numeric = Pipeline([
('transformer', NumberTransformer(key='word_count')),
('scalar', StandardScaler())
])
word_class = Pipeline([
('transformer', NumberTransformer(key='preds')),
('scalar', StandardScaler())
])
# Объединение всех признаков
features = FeatureUnion([('Text_Feature', text),
('Num1_Feature', word_numeric),
('Num2_Feature', word_class)
])
# Классификатор
clf = model
# Объединение классификатора и признаков
pipe = Pipeline([('features', features),
('clf',clf)
])
# Обучение модели
pipe_fit=pipe.fit(X_train, y_train)
# Предсказание данных
preds = pipe_fit.predict(X_test)
return preds, pipe_fit

Python - SkLearn Logistic Regression: One-by-one train instance

Here is my question, I have a huge train set so I can't load it in memory and apply this code.
model = LogisticRegression()
model = model.fit(train_set_df, y_label_df)
So, I am looking for a way to train my Sklearn.LogisticRegression model by passing instances one-by-one in order to avoid loading all the train data in memory. Thanks

You are looking for the partial_fit method. LogisticRegression does not support it. You can use MultinomialNB (or any other Naive Bayes) or SGDClassifier instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.