Using custom Pipeline for Cross Validation scikit-learn

Using custom Pipeline for Cross Validation scikit-learn - python

I would like to be use GridSearchCV to determine the parameters of a classifier, and using pipelines seems like a good option.
The application will be for image classification using Bag-of-Word features, but the issue is that there is a different logical pipeline depending on whether training or test examples are used.
For each training set, KMeans must run to produce a vocabulary that will be used for testing, but for test data no KMeans process is run.
I cannot see how it is possible to specify this difference in behavior for a pipeline.

You probably need to derive from the KMeans class and override the following methods to use your vocabulary logic:
fit_transform will only be called on the train data
transform will be called on the test data
Maybe class derivation is not alway the best option. You can also write your own transformer class that wraps calls to an embedded KMeans model and provides the fit / fit_transform / transform API that is expected by the Pipeline class for the first stages.

Related

Including a Predictor in a Pipeline with Scikit-Learn

Actually this doubt is more like -- "why is this code working properly?".
I was working out a problem from a text book. Specifically, the problem was to build a Pipeline that had a Data Preparation phase (remove NA values, perform Feature Scaling etc.) and then a Prediction phase, which involves a Predictor trained on the transformed dataset and returning its predictions.
Here, we used a Support Vector Regressor module (sklearn.svm.svr).
I tried some code of mine, but it didn't work. So I looked up the actual solution provided by the author of the textbook -
prepare_select_and_predict_pipeline = Pipeline([
('preparation', data_prep),
('svm_reg', SVR(kernel='rbf',C=30000,gamma='scale'))
])
prepare_select_and_predict_pipeline.fit(x_train,y_train)
some_data = x_train.iloc[:4]
print("Predictions for a subset of Training Set:",prepare_select_and_predict_pipeline.predict(some_data))
I tried this code, and it does work as expected.
How can it work properly? My main objections are:
We have only fit the dataset, but where are we actually
transforming it? We are not calling a transform() function anywhere...
Also, how can we use the predict() function with this pipeline? SVR
might be a part of this pipeline, but so are the other transformers,
and they don't have a predict() function.
Thanks in advance for your answers!

When you perform fit on the Pipeline scikit-learn performs under the hood fit_transform of preprocessing step and fit on last step (classifier|regressor). When you call predict on the Pipeline scikit-learn perform transform on the preprocessing stage and predict on the last step.
Now, the definition of the model is not the last step but all the steps that takes in data and output results. The Pipeline is now a model. If you used GridSearchCV which has Pipelines, and Pipelines has preprocessing and final steps (regressor|classifier), then GridSearchCV is now the model.
See Pipeline Documentation

Scikit-learn multithreading

Do you know if models from scikit-learn use automatically multithreading or just sequential instructions?
Thanks

No. All scikit-learn estimators will by default work on a single thread only.
But then again, it all depends on the algorithm and the problem. If the algorithm is such that which want sequential data, we cannot do anything. If the dataset is multi-class or multi-label and algorithm works on a one-vs-rest basis, then yes it can use multi-threading.
Look for a param n_jobs in the utilities or algorithm you want to use, and set it to -1 for using the multi-threading.
For eg.
LogisticRegression if working in a binary problem will only train a single model, which will require data sequentially, so here using n_jobs have no effect. But it handles multi-class problems as OvR, so it will have to train those many estimators using the same data. In this case you can use the n_jobs=-1.
DecisionTreeClassifier is inherently multi-class enabled and dont need to train multiple models. So we dont have that param there.
Ensemble methods like RandomForestClassifier will train multiple estimators (irrespective of problem type) which individually work on some part of data, so here again we can make use of n_jobs.
Cross-validation utilities like cross_val_score or GridSearchCV will again work on some part of data or some individual parameters, which is independent of other folds, so here also we can use multi-threading capabilities.

xgboost.train versus XGBClassifier

I am using python to fit an xgboost model incrementally (chunk by chunk). I came across a solution that uses xgboost.train but I do not know what to do with the Booster object that it returns. For instance, the XGBClassifier has options like fit, predict, predict_proba etc.
Here is what happens inside the for loop that I am reading in the data little by little:
dtrain=xgb.DMatrix(X_train, label=y)
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
modelXG=xgb.train(param,dtrain,xgb_model='xgbmodel')
modelXG.save_model("xgbmodel")

XGBClassifier is a scikit-learn compatible class which can be used in conjunction with other scikit-learn utilities.
Other than that, its just a wrapper over the xgb.train, in which you dont need to supply advanced objects like Booster etc.
Just send your data to fit(), predict() etc and internally it will be converted to appropriate objects automatically.

I'm not entirely sure what was your question. xgb.XGBMClassifier.fit() under the hood calls xgb.train() so it is a matter of matching us arguments of relevant functions.
If you are interested how to implement the learning that you have in mind, then you can do
clf = xgb.XGBClassifier(**params)
clf.fit(X, y, xgb_model=your_model)
See the documentation here. On each iteration you will have to save the booster using something like clf.get_booster().save_model(xxx).
PS I hope you do learning in mini-batches, i.e. chunks and not literally line-by-line, i.e. example-by-example, as that would result in performance drop due to writing/reading the model each time

Computing TF-IDF on the whole dataset or only on training data?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

According to the documentation of scikit-learn, fit() is used in order to
Learn vocabulary and idf from training set.
On the other hand, fit_transform() is used in order to
Learn vocabulary and idf, return term-document matrix.
while transform()
Transforms documents to document-term matrix.
On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).
Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.
For more details you can refer to the article fit() vs transform() vs fit_transform()

Author gives all text data before separating train and test to
function. Is it a true action or we must separate data first then
perform tfidf fit_transform on train and transform on test?
I would consider this as already leaking some information about the test set into the training set.
I tend to always follow the rule that before any pre-processing first thing to do is to separate the data, create a hold-out set.

As we are talking about text data, we have to make sure that the model is trained only on the vocabulary of the training set as when we will deploy a model in real life, it will encounter words that it has never seen before so we have to do the validation on the test set keeping that in mind.
We have to make sure that the new words in the test set are not a part of the vocabulary of the model.
Hence we have to use fit_transform on the training data and transform on the test data.
If you think about doing cross validation, then you can use this logic across all the folds.

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To make my question clearer, what are steps? How do they work?
Edit
Thanks to the answers I can make my question clearer:
When I call pipeline and pass, as steps, two transformers and one estimator, e.g:
pipln = Pipeline([("trsfm1",transformer_1),
("trsfm2",transformer_2),
("estmtr",estimator)])
What happens when I call this?
pipln.fit()
OR
pipln.fit_transform()
I can't figure out how an estimator can be a transformer and how a transformer can be fitted.

Transformer in scikit-learn - some class that have fit and transform method, or fit_transform method.
Predictor - some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage.
Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
With just:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor.
Answer to edit:
When you call pipln.fit() - each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.

I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:
Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().
As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
bin = LabelBinarizer() #first we initialize
vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer 'knows' about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.
print bin.classes_
I get the following error when trying this:
AttributeError: 'LabelBinarizer' object has no attribute 'classes_'
But when you fit the binarizer on the vec list:
bin.fit(vec)
and try again
print bin.classes_
I get the following:
['cat' 'dog']
print bin.transform(vec)
And now, after calling transform on the vec object, we get the following:
[[0]
[1]
[1]
[1]]
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.
What are ML pipelines and how do they work?
A pipeline is a series of steps in which data is transformed. It comes from the old "pipe and filter" design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Pipelining Machine Learning as:
Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step's output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.
Methods of a Scikit-Learn Pipeline
Pipelines (or steps in the pipeline) must have those two methods:
“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform" (or "predict") to actually process the data and generate a prediction.
It's also possible to call this method to chain both:
“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
Problems of the sklearn.pipeline.Pipeline class
Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.
Solutions that we’ve Found to Those Scikit-Learn's Problems
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Inability to Reasonably do Automatic Machine Learning (AutoML)
Problem: Defining the Search Space (Hyperparameter Distributions)
Problem: Defining Hyperparameters in the Constructor is Limiting
Problem: Different Train and Test Behavior
Problem: You trained a Pipeline and You Want Feedback on its Learning.
Inability to Reasonably do Deep Learning Pipelines
Problem: Scikit-Learn Hardly Allows for Mini-Batch Gradient Descent (Incremental Fit)
Problem: Initializing the Pipeline and Deallocating Resources
Problem: It is Difficult to Use Other Deep Learning (DL) Libraries in Scikit-Learn
Problem: The Ability to Transform Output Labels
Not ready for Production nor for Complex Pipelines
Problem: Processing 3D, 4D, or ND Data in your Pipeline with Steps Made for Lower-Dimensional Data
Problem: Modify a Pipeline Along the Way, such as for Pre-Training or Fine-Tuning
Problem: Getting Model Attributes from Scikit-Learn Pipeline
Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized "as-is" by Joblib
Additional pipeline methods and features offered through Neuraxle
Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
As a starter, it is possible for pipelines or their steps to also optionally define those methods:
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.
The following methods are provided by default to allow for managing hyperparameters:
“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.
For more info on our suggested solutions, read the entries in the big list with links above.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pandas as pd
class TextTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование текстовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
return X[self.key]
class NumberTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование числовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
def fit_predict(model, X_train, X_test, y_train, y_test):
vec_tdidf = TfidfVectorizer(ngram_range=(2,2), analyzer='word', norm='l2')
text = Pipeline([
('transformer', TextTransformer(key='clear_messages')),
('vectorizer', vec_tdidf)
])
word_numeric = Pipeline([
('transformer', NumberTransformer(key='word_count')),
('scalar', StandardScaler())
])
word_class = Pipeline([
('transformer', NumberTransformer(key='preds')),
('scalar', StandardScaler())
])
# Объединение всех признаков
features = FeatureUnion([('Text_Feature', text),
('Num1_Feature', word_numeric),
('Num2_Feature', word_class)
])
# Классификатор
clf = model
# Объединение классификатора и признаков
pipe = Pipeline([('features', features),
('clf',clf)
])
# Обучение модели
pipe_fit=pipe.fit(X_train, y_train)
# Предсказание данных
preds = pipe_fit.predict(X_test)
return preds, pipe_fit

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.