Insert or delete a step in scikit-learn Pipeline

Insert or delete a step in scikit-learn Pipeline - python

Is it possible to delete or insert a step in a sklearn.pipeline.Pipeline object?
I am trying to do a grid search with or without one step in the Pipeline object. And wondering whether I can insert or delete a step in the pipeline. I saw in the Pipeline source code, there is a self.steps object holding all the steps. We can get the steps by named_steps(). Before modifying it, I want to make sure, I do not cause unexpected effects.
Here is a example code:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf = Pipeline(estimators)
clf
Is it possible that we do something like steps = clf.named_steps(), then insert or delete in this list? Does this cause undesired effect on the clf object?

I see that everyone mentioned only the delete step. In case you want to also insert a step in the pipeline:
pipe.steps.append(['step name',transformer()])
pipe.steps works in the same way as lists do, so you can also insert an item into a specific location:
pipe.steps.insert(1,['estimator',transformer()]) #insert as second step

Based on rudimentary testing you can safely remove a step from a scikit-learn pipeline just like you would any list item, with a simple
clf_pipeline.steps.pop(n)
where n is the position of the individual estimator you are trying to remove.

Just chiming in because I feel like the other answers answered the question of adding steps to a pipeline really well, but didn't really cover how to delete a step from a pipeline.
Watch out with my approach though. Slicing lists in this instance is a bit weird.
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
estimators = [('reduce_dim', PCA()), ('poly', PolynomialFeatures()), ('svm', SVC())]
clf = Pipeline(estimators)
If you want to create a pipeline with just steps PCA/Polynomial you can just slice the list step by indexes and pass it to Pipeline
clf1 = Pipeline(clf.steps[0:2])
Want to just use steps 2/3?
Watch out these slices don't always make the most amount of sense
clf2 = Pipeline(clf.steps[1:3])
Want to just use steps 1/3?
I can't seem to do using this approach
clf3 = Pipeline(clf.steps[0] + clf.steps[2]) # errors

Yes, that's possible, but you must fulfill same requirements which Pipeline requires at initialization, i.e. you cannot insert predictor in any step except last, you should call fit after you update Pipeline.steps, because after such update all steps (maybe they were learned in previous fit calls) will be invalidated, also last step of Pipeline should always implement fit method, all previous steps should implement fit_transform.
So yes, it will work in current codebase, but i think it's not a good solution for your task, it makes your code more dependent on current implementation of Pipeline, i think it's more convenient to create new Pipeline with modified steps, because Pipeline will at least validate all your steps in initialization, also creating new Pipeline will not significantly differ in terms of speed from modifying steps of existing pipeline, but as i've just said - creation of new Pipeline after each modification of steps is safer in case when someone will significantly change implementation of Pipeline.

Related

Is it possible to fit() a scikit-learn model in a loop or with an iterator

Usually people use scikit-learn to train a model this way:
from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):
clf = gbc()
for i in range(X_train.shape[0]):
clf.fit(X_train[i], y_train[i])
so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:
Calling fit() more than once will overwrite what was learned by any previous fit()
So, is this possible?

This do not work in scikit-learn as explained in the comment section as well as in the documentation. However you can use river ( which is a python package for online/streaming machine learning). This package should be well-suited for you problematic.
Below is an example of training a LinearRegression using river.
from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing
dataset = datasets.TrumpApproval()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()
for x, y, in dataset:
y_pred = model.predict_one(x)
# Update the running metric with the prediction and ground truth value
metric.update(y, y_pred)
# Train the model with the new sample
model.learn_one(x, y)

It is not clear in your question is which steps in the machine learning are slow for you. As also noted in the manual for riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.
So using your example lets say we use a stochastic gradient descent classifier:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes = list(set(y))
for ix in np.split(np.arange(0,X.shape[0]),100):
clf.partial_fit(X[ix,:],y[ix],classes = all_classes)

After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by #StupidWolf in this post, I am aware that this question is more to this than meets the eye.
The real difficulty is about the design of a lot of models.
Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.
Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.

Can I train my classifier multiple times?

I am building a basic NLP program using nltk and sklearn. I have a large dataset in a database and I am wondering what the best way to train the classifier is.
Is it advisable to download the training data in chunks and pass each chunk to the classifier? Is that even possible, or would I be overwriting what was learned from the previous chunk?
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
while True:
training_set, proceed = download_chunk() # pseudo
trained = SklearnClassifier(MultinomialNB()).train(training_set)
if not proceed:
break
How is this normally done? I want to avoid keeping the database connection open for too long.

The way you're doing it right now will actually just overwrite the classifier for each chunk in your training data as you're creating a new SklearnClassifier object each time. What you need to do is instantiate the SklearnClassifier prior to getting into the training loop. However, looking at the code here, it appears that the NLTK SklearnClassifier uses the fit method of the underlying Sklearn model. This means that you can't actually update a model once it is trained. What you need to do is instantiate the Sklearn model directly and use the partial_fit method. Something like this should work:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() # must instantiate classifier outside of the loop or it will just get overwritten
while True:
training_set, proceed = download_chunk() # pseudo
clf.partial_fit(training_set)
if not proceed:
break
At the end, you'll have a MultinomialNB() classifier that has been trained on each chunk of your data.
Typically, if the whole dataset will fit in memory, it is somewhat more performant to just download the whole thing and call fit once (in which case you could actually use the nltk SklearnClassifier). See the notes about the partial_fit method here. However, if you are unable to fit the entire set in memory, it is certainly common practice to train on chunks of the data. You can do this by making several calls to the database or by extracting all of the information from the database, placing it in a CSV on your hard drive, and reading chunks of it from there.
Note
If you're using a shared database with other users, the DBAs may prefer you to extract all of it at once as once as this would (probably) take up fewer DB resources than making several separate, smaller calls to the database would.

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To make my question clearer, what are steps? How do they work?
Edit
Thanks to the answers I can make my question clearer:
When I call pipeline and pass, as steps, two transformers and one estimator, e.g:
pipln = Pipeline([("trsfm1",transformer_1),
("trsfm2",transformer_2),
("estmtr",estimator)])
What happens when I call this?
pipln.fit()
OR
pipln.fit_transform()
I can't figure out how an estimator can be a transformer and how a transformer can be fitted.

Transformer in scikit-learn - some class that have fit and transform method, or fit_transform method.
Predictor - some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage.
Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
With just:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor.
Answer to edit:
When you call pipln.fit() - each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.

I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:
Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().
As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
bin = LabelBinarizer() #first we initialize
vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer 'knows' about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.
print bin.classes_
I get the following error when trying this:
AttributeError: 'LabelBinarizer' object has no attribute 'classes_'
But when you fit the binarizer on the vec list:
bin.fit(vec)
and try again
print bin.classes_
I get the following:
['cat' 'dog']
print bin.transform(vec)
And now, after calling transform on the vec object, we get the following:
[[0]
[1]
[1]
[1]]
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.
What are ML pipelines and how do they work?
A pipeline is a series of steps in which data is transformed. It comes from the old "pipe and filter" design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Pipelining Machine Learning as:
Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step's output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.
Methods of a Scikit-Learn Pipeline
Pipelines (or steps in the pipeline) must have those two methods:
“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform" (or "predict") to actually process the data and generate a prediction.
It's also possible to call this method to chain both:
“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
Problems of the sklearn.pipeline.Pipeline class
Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.
Solutions that we’ve Found to Those Scikit-Learn's Problems
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Inability to Reasonably do Automatic Machine Learning (AutoML)
Problem: Defining the Search Space (Hyperparameter Distributions)
Problem: Defining Hyperparameters in the Constructor is Limiting
Problem: Different Train and Test Behavior
Problem: You trained a Pipeline and You Want Feedback on its Learning.
Inability to Reasonably do Deep Learning Pipelines
Problem: Scikit-Learn Hardly Allows for Mini-Batch Gradient Descent (Incremental Fit)
Problem: Initializing the Pipeline and Deallocating Resources
Problem: It is Difficult to Use Other Deep Learning (DL) Libraries in Scikit-Learn
Problem: The Ability to Transform Output Labels
Not ready for Production nor for Complex Pipelines
Problem: Processing 3D, 4D, or ND Data in your Pipeline with Steps Made for Lower-Dimensional Data
Problem: Modify a Pipeline Along the Way, such as for Pre-Training or Fine-Tuning
Problem: Getting Model Attributes from Scikit-Learn Pipeline
Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized "as-is" by Joblib
Additional pipeline methods and features offered through Neuraxle
Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
As a starter, it is possible for pipelines or their steps to also optionally define those methods:
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.
The following methods are provided by default to allow for managing hyperparameters:
“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.
For more info on our suggested solutions, read the entries in the big list with links above.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pandas as pd
class TextTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование текстовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
return X[self.key]
class NumberTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование числовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
def fit_predict(model, X_train, X_test, y_train, y_test):
vec_tdidf = TfidfVectorizer(ngram_range=(2,2), analyzer='word', norm='l2')
text = Pipeline([
('transformer', TextTransformer(key='clear_messages')),
('vectorizer', vec_tdidf)
])
word_numeric = Pipeline([
('transformer', NumberTransformer(key='word_count')),
('scalar', StandardScaler())
])
word_class = Pipeline([
('transformer', NumberTransformer(key='preds')),
('scalar', StandardScaler())
])
# Объединение всех признаков
features = FeatureUnion([('Text_Feature', text),
('Num1_Feature', word_numeric),
('Num2_Feature', word_class)
])
# Классификатор
clf = model
# Объединение классификатора и признаков
pipe = Pipeline([('features', features),
('clf',clf)
])
# Обучение модели
pipe_fit=pipe.fit(X_train, y_train)
# Предсказание данных
preds = pipe_fit.predict(X_test)
return preds, pipe_fit

Set the weights of decision functions through stdin in Sklearn

Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.

This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.

Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.

Is it possible to toggle a certain step in sklearn pipeline?

I wonder if we can set up an "optional" step in sklearn.pipeline. For example, for a classification problem, I may want to try an ExtraTreesClassifier with AND without a PCA transformation ahead of it. In practice, it might be a pipeline with an extra parameter specifying the toggle of the PCA step, so that I can optimize on it via GridSearch and etc. I don't see such an implementation in sklearn source, but is there any work-around?
Furthermore, since the possible parameter values of a following step in pipeline might depend on the parameters in a previous step (e.g., valid values of ExtraTreesClassifier.max_features depend on PCA.n_components), is it possible to specify such a conditional dependency in sklearn.pipeline and sklearn.grid_search?
Thank you!

From the docs:
Individual steps may also be replaced as parameters, and non-final
steps may be ignored by setting them to None:
from sklearn.linear_model import LogisticRegression
params = dict(reduce_dim=[None, PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)

Pipeline steps cannot currently be made optional in a grid search but you could wrap the PCA class into your own OptionalPCA component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.
For the dependent parameters problem, GridSearchCV supports trees of parameters to handle this case as demonstrated in the documentation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.