Sklearn - fit, scale and transform

Sklearn - fit, scale and transform - python

The fit() method in sklearn appears to be serving different purposes in same interface.
When applied to the training set, like so:
model.fit(X_train, y_train)
fit() is used to learn parameters that will later be used on the test set with predict(X_test)
However, there are cases when there is no 'learning' involved with fit(), but only some normalization to transform the data, like so:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(X_train)
which will simply scale feature values between, say, 0 and 1, to avoid some features with higher variance to have a disproportional influence on the model.
To make things even less intuitive, sometimes the fit() method that scales (and already appears to be transforming) needs to be followed by further transform() method, before being called again with the fit() that actually learns and builds the model, like so:
X_train2 = min_max_scaler.transform(X_train)
X_test2 = min_max_scaler.transform(X_test)
# the model being used
knn = KNeighborsClassifier(n_neighbors=3,metric="euclidean")
# learn parameters
knn.fit(X_train2, y_train)
# predict
y_pred = knn.predict(X_test2)
Could someone please clarify the use, or multiple uses, of fit(), as well as the difference of scaling and transforming the data?

fit() function provides a common interface that is shared among all scikit-learn objects.
This function takes as argument X ( and sometime y array to compute the object's statistics. For example, calling fit on a MinMaxScaler transformer will compute its statistics (data_min_, data_max_, data_range_...
Therefore we should see the fit() function as a method that compute the necessary statistics of an object.
This commons interface is really helpful as it allows to combine transformer and estimators together using a Pipeline. This allows to compute and predict all steps in one go as follows:
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
X, y = make_classification(n_samples=1000)
model = make_pipeline(MinMaxScaler(), NearestNeighbors())
model.fit(X, y)
This offers also the possibility to serialize the whole model into one single object.
Without this composition module, I can agree with you that it is not very practically to work with independent transformer and estimator.

In scikit-learn there are 3 classes that share interface: Estimators, Transformers and Predictors
Estimators have fit() function, which serves always the same purpose. It estimates parameters based on the dataset.
Transformers have transform() function. It returns the transformed dataset. Some Estimators are also Transformers, e.g. MinMaxScaler()
Predictors have predict() function, which returns predictions on new instances, e.g. KNeighborsClassifier()
Both MinMaxScaler() and KNeighborClassifier() contain fit() method, because they share interface of an Estimator.
However, there are cases when there is no 'learning' involved with fit()
There is 'learning' involved. Transformer, MinMaxScaler() has to 'learn' min and max values for each numerical feature.
When you call min_max_scaler.fit(X_train) your scaler estimates values for each numerical column in your train set. min_max_scaler.transform(X_train) scales your train set based on the estimations. min_max_scaler.transform(X_test) scales the test set with the estimations learned for train set. This is important to scale both train and test set with the same estimations.
For further reading, you can check this: https://arxiv.org/abs/1309.0238

Related

cross_val_score behaves differently with different classifiers in sklearn

I'm having some difficulty with cross_val_score() in sklearn.
I have instantiated a KNeighborsClassifier with the following code:
clf = KNeighborsClassifier(n_neighbors=28)
I am then using cross validation to understand the accuracy of this classifier on my df of features (x) and target series (y) with the following:
cv_score_av = np.mean(cross_val_score(clf, x, y, cv=5))
Each time I run the script I was hoping to achieve a different result, however there is not an option to set random_state=None like there is with RandomForestClassifier() for example. Is there a way to achieve a different result with each run or am I going to have to manually shuffle my data randomly prior to running cross_val_score on my KNeighborsClassifier model.

There seems to be some misunderstanding here from your part; the random_state argument in the Random Forest refers to the algorithm itself, and not to the cross validation part. Such an argument is necessary here, since RF includes indeed some randomness in model building (a lot of it, in fact, as already implied by the very name of the alforithm); but knn, in contrast, is a deterministic algorithm, so in principle there is no need for it to use any random_state.
That said, your question is indeed valid; I have commented in the past on this annoying and inconvenient absence of a shuffling argument in cross_val_score. Digging into the documentation, we see that under the hood, the function uses either StratifiedKFold or KFold to build the folds:
cv : int, cross-validation generator or an iterable, optional
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
and both of these functions, as you can easily see from the linked documentation pages, use shuffle=False as default value.
Anyway, the solution is simple, consisting of a single additional line of code; you just need to replace cv=5 with a call to a previously defined StratifiedKFold object with shuffle=True:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True)
cv_score_av = np.mean(cross_val_score(ml_10_knn, x, y, cv=skf))

The necessity of feature scaling before fitting a classifier in scikit-learn

I used to believe that scikit-learn's Logistic Regression classifier (as well as SVM) automatically standardizes my data before training. The reason I used to believe it is because of the regularization parameter C that is passed to the LogisticRegression constructor: Applying regularization (as I understand it) doesn't make sense without feature scaling. For regularization to work properly, all the features should be on comparable scales. Therefore, I used to assume that when calling the LogisticRegression.fit(X) on training data X, the fit method first performs feature scaling and then starts training. In order to test my assumption I've decided to manually scale the features of X as follows:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)
Then I've initialized a LogisticRegression object with a regularization parameter C:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=10.0, random_state=0)
I've found out that training the model on X is not equivalent to training the model on X_std. That is to say, the model produced by
log_reg.fit(X_std, y)
is not similar to the model produced by
log_reg.fit(X, y)
Does that mean that scikit-learn doesn't standardize the features before training? Or maybe it does scale but by applying a different procedure? If scikit-learn doesn't perform feature scaling, how is it consistent with requiring the regularization parameter C? Should I manually standardize my data every time before fitting the model in order for regularization to make sense?

From the following note in: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I'd assume that you need to preprocess the data yourself (e.g. with a scaler from sklearn.preprocessing.)
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ is faster for large ones.
For multiclass problems, only ‘newton-cg’ and ‘lbfgs’ handle multinomial loss; ‘sag’ and ‘liblinear’ are limited to one-versus-rest schemes.
‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty.
Note that ‘sag’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To make my question clearer, what are steps? How do they work?
Edit
Thanks to the answers I can make my question clearer:
When I call pipeline and pass, as steps, two transformers and one estimator, e.g:
pipln = Pipeline([("trsfm1",transformer_1),
("trsfm2",transformer_2),
("estmtr",estimator)])
What happens when I call this?
pipln.fit()
OR
pipln.fit_transform()
I can't figure out how an estimator can be a transformer and how a transformer can be fitted.

Transformer in scikit-learn - some class that have fit and transform method, or fit_transform method.
Predictor - some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it's not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage.
Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
With just:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor.
Answer to edit:
When you call pipln.fit() - each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can't call fit_transform or transform on pipeline, last step of which is predictor.

I think that M0rkHaV has the right idea. Scikit-learn's pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let's break down the two major components:
Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you'll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You'll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn't necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn't be able to call predict().
As for your edit: let's go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
bin = LabelBinarizer() #first we initialize
vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer 'knows' about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn't make any sense. This is true if you print out the list of classes before trying to fit the data.
print bin.classes_
I get the following error when trying this:
AttributeError: 'LabelBinarizer' object has no attribute 'classes_'
But when you fit the binarizer on the vec list:
bin.fit(vec)
and try again
print bin.classes_
I get the following:
['cat' 'dog']
print bin.transform(vec)
And now, after calling transform on the vec object, we get the following:
[[0]
[1]
[1]
[1]]
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what's important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.
What are ML pipelines and how do they work?
A pipeline is a series of steps in which data is transformed. It comes from the old "pipe and filter" design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Pipelining Machine Learning as:
Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step's output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.
Methods of a Scikit-Learn Pipeline
Pipelines (or steps in the pipeline) must have those two methods:
“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform" (or "predict") to actually process the data and generate a prediction.
It's also possible to call this method to chain both:
“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
Problems of the sklearn.pipeline.Pipeline class
Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style - it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.
Solutions that we’ve Found to Those Scikit-Learn's Problems
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Inability to Reasonably do Automatic Machine Learning (AutoML)
Problem: Defining the Search Space (Hyperparameter Distributions)
Problem: Defining Hyperparameters in the Constructor is Limiting
Problem: Different Train and Test Behavior
Problem: You trained a Pipeline and You Want Feedback on its Learning.
Inability to Reasonably do Deep Learning Pipelines
Problem: Scikit-Learn Hardly Allows for Mini-Batch Gradient Descent (Incremental Fit)
Problem: Initializing the Pipeline and Deallocating Resources
Problem: It is Difficult to Use Other Deep Learning (DL) Libraries in Scikit-Learn
Problem: The Ability to Transform Output Labels
Not ready for Production nor for Complex Pipelines
Problem: Processing 3D, 4D, or ND Data in your Pipeline with Steps Made for Lower-Dimensional Data
Problem: Modify a Pipeline Along the Way, such as for Pre-Training or Fine-Tuning
Problem: Getting Model Attributes from Scikit-Learn Pipeline
Problem: You can't Parallelize nor Save Pipelines Using Steps that Can't be Serialized "as-is" by Joblib
Additional pipeline methods and features offered through Neuraxle
Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
As a starter, it is possible for pipelines or their steps to also optionally define those methods:
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.
The following methods are provided by default to allow for managing hyperparameters:
“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.
For more info on our suggested solutions, read the entries in the big list with links above.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import pandas as pd
class TextTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование текстовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None, *parg, **kwarg):
return self
def transform(self, X):
return X[self.key]
class NumberTransformer(BaseEstimator, TransformerMixin):
"""
Преобразование числовых признаков
"""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.key]]
def fit_predict(model, X_train, X_test, y_train, y_test):
vec_tdidf = TfidfVectorizer(ngram_range=(2,2), analyzer='word', norm='l2')
text = Pipeline([
('transformer', TextTransformer(key='clear_messages')),
('vectorizer', vec_tdidf)
])
word_numeric = Pipeline([
('transformer', NumberTransformer(key='word_count')),
('scalar', StandardScaler())
])
word_class = Pipeline([
('transformer', NumberTransformer(key='preds')),
('scalar', StandardScaler())
])
# Объединение всех признаков
features = FeatureUnion([('Text_Feature', text),
('Num1_Feature', word_numeric),
('Num2_Feature', word_class)
])
# Классификатор
clf = model
# Объединение классификатора и признаков
pipe = Pipeline([('features', features),
('clf',clf)
])
# Обучение модели
pipe_fit=pipe.fit(X_train, y_train)
# Предсказание данных
preds = pipe_fit.predict(X_test)
return preds, pipe_fit

How does LassoCV in scikit-learn partition data?

I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.

If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

scikit-learn preprocessing SVM with multiple classes in a pipeline

The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.
What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)

The feature scaling performed by StandardScaler is performed without reference to the target classes. It only considers the X feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.
Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y). It roughly does the following:
X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)
The first scaling step actually ignores the y it is passed, but calculates the mean and standard deviation of each feature in X_train and stores them in its mean_ and std_ attributes (the fit component). It also centers X_train and returns it (the transform component). The next step learns an SVM model, and does what is necessary for one-vs-rest.
Now the pipeline's perspective for classification. clf.predict(X_test) expands to:
X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)
returning y_pred. In the first line it uses the stored mean_ and std_ to apply the transformation to X_test using parameters learnt from the training data.
Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar:
provides a name to the algorithm so you can pull it out of the library
avoids you rolling your own, ensuring it works correctly, and not requiring you to understand what it's doing on the inside
remembers the parameters from a fit or fit_transform for later transform operations (as above)
provides the same interface as other data transformations (and hence can be used in a pipeline)
operates over dense or sparse matrices
is able to reverse the transformation with its inverse_transform method

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.