The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.
What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)
The feature scaling performed by StandardScaler is performed without reference to the target classes. It only considers the X feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.
Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y). It roughly does the following:
X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)
The first scaling step actually ignores the y it is passed, but calculates the mean and standard deviation of each feature in X_train and stores them in its mean_ and std_ attributes (the fit component). It also centers X_train and returns it (the transform component). The next step learns an SVM model, and does what is necessary for one-vs-rest.
Now the pipeline's perspective for classification. clf.predict(X_test) expands to:
X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)
returning y_pred. In the first line it uses the stored mean_ and std_ to apply the transformation to X_test using parameters learnt from the training data.
Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar:
provides a name to the algorithm so you can pull it out of the library
avoids you rolling your own, ensuring it works correctly, and not requiring you to understand what it's doing on the inside
remembers the parameters from a fit or fit_transform for later transform operations (as above)
provides the same interface as other data transformations (and hence can be used in a pipeline)
operates over dense or sparse matrices
is able to reverse the transformation with its inverse_transform method
Related
Two questions:
I'm trying to run a model that predicts churn. A lot of my features have multicollinearity issues. To address this problem I'm trying to penalize the coefficients with Ridge.
More specifically I'm trying to run a logistic regression but apply Ridge penalties (not sure if that makes sense) to the model...
Questions:
Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e. LogisticRegression(apply_penality=Ridge)
I'm trying to determine feature importance and through some research, it seems like I need to use this:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
However, I'm confused on how to access this function if my model has been built around sklearn.pipeline.make_pipeline function.
I'm just trying to figure out which independent variables have the most importance in predicting my label.
Code below for reference
#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')
#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]
#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())
#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()
#putting together a parameter grid to search over using grid search
params={
'selectkbest__k':[1,2,3,4,5,6],
'ridge__fit_intercept':[True,False],
'ridge__alpha':[0.01,0.1,1,10],
'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)
#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')
#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'mean_test_score',
'std_test_score', 'rank_test_score']].head()
#checking the selected permutation of parameters
gs.best_params_
'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs.
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)
#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)
Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e. LogisticRegression(apply_penality=Ridge)
So the question between ridge regression and Logistic here comes down to whether or not you are trying to do classification or regression. If you want to predict the quantity of churn on some continuous basis use ridge, if you want to predict did someone churn or are they likely to churn use logistic regression.
Sklearn's LogisticRegression uses l2 normalization by default which is equivalent to the regularization used by ridge regeression. So you should be fine using that if it's the regularization that you want : )
I'm trying to determine feature importance and through some research, it seems like I need to use this.
In general you can access the elements of a pipeline through the named_steps attribute. so in your case if you wanted to access SelectKBest you could do:
pipe.named_steps["SelectKBest"].get_feature_names()
That's going to get you the feature names, now you still need the values. Here you have to access your models learned coefficients. For ridge and logistic regression it should be something like:
pipe.named_steps["logisticregression"].coef_
I have a blog post about this if you want a more detailed tutorial here
We can generate a multi-target regression dataset using the make_regression() function of the sklearn. Here, the number of targets is `2
X, y = make_regression(n_samples=5000, n_features=10, n_informative=7, n_targets=2, random_state=1, noise=5)
Now, I want to make a multi-target dataset where the ranges (or patterns) of the target variables will be different. So that different ML models can fit and predict well for different targets.
Say, I have 2 targets in a dataset. Target 1 might fit and predict very well by Linear, Lasso, or Ridge while target 2 will fit and predict well by RF, SVR or Knn.
Any idea how can I make this type of dataset?
According to the document (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html), this function generates data by a linear regression model. This means data generated by this function should fit well by linear methods. Perhaps lasso would work better if the number of informative features is much smaller than the number of all features because lasso tends to create sparse models.
To generate data that does not fit well with linear models, but good with non-linear models such as RF, SVR, KNN, you will need to add non-linearity to the data. As an example approach, transforming y by some non-linear function such as sin(y) may work (I did not try though).
The fit() method in sklearn appears to be serving different purposes in same interface.
When applied to the training set, like so:
model.fit(X_train, y_train)
fit() is used to learn parameters that will later be used on the test set with predict(X_test)
However, there are cases when there is no 'learning' involved with fit(), but only some normalization to transform the data, like so:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(X_train)
which will simply scale feature values between, say, 0 and 1, to avoid some features with higher variance to have a disproportional influence on the model.
To make things even less intuitive, sometimes the fit() method that scales (and already appears to be transforming) needs to be followed by further transform() method, before being called again with the fit() that actually learns and builds the model, like so:
X_train2 = min_max_scaler.transform(X_train)
X_test2 = min_max_scaler.transform(X_test)
# the model being used
knn = KNeighborsClassifier(n_neighbors=3,metric="euclidean")
# learn parameters
knn.fit(X_train2, y_train)
# predict
y_pred = knn.predict(X_test2)
Could someone please clarify the use, or multiple uses, of fit(), as well as the difference of scaling and transforming the data?
fit() function provides a common interface that is shared among all scikit-learn objects.
This function takes as argument X ( and sometime y array to compute the object's statistics. For example, calling fit on a MinMaxScaler transformer will compute its statistics (data_min_, data_max_, data_range_...
Therefore we should see the fit() function as a method that compute the necessary statistics of an object.
This commons interface is really helpful as it allows to combine transformer and estimators together using a Pipeline. This allows to compute and predict all steps in one go as follows:
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
X, y = make_classification(n_samples=1000)
model = make_pipeline(MinMaxScaler(), NearestNeighbors())
model.fit(X, y)
This offers also the possibility to serialize the whole model into one single object.
Without this composition module, I can agree with you that it is not very practically to work with independent transformer and estimator.
In scikit-learn there are 3 classes that share interface: Estimators, Transformers and Predictors
Estimators have fit() function, which serves always the same purpose. It estimates parameters based on the dataset.
Transformers have transform() function. It returns the transformed dataset. Some Estimators are also Transformers, e.g. MinMaxScaler()
Predictors have predict() function, which returns predictions on new instances, e.g. KNeighborsClassifier()
Both MinMaxScaler() and KNeighborClassifier() contain fit() method, because they share interface of an Estimator.
However, there are cases when there is no 'learning' involved with fit()
There is 'learning' involved. Transformer, MinMaxScaler() has to 'learn' min and max values for each numerical feature.
When you call min_max_scaler.fit(X_train) your scaler estimates values for each numerical column in your train set. min_max_scaler.transform(X_train) scales your train set based on the estimations. min_max_scaler.transform(X_test) scales the test set with the estimations learned for train set. This is important to scale both train and test set with the same estimations.
For further reading, you can check this: https://arxiv.org/abs/1309.0238
I'm trying to get the top 10 most informative (best) features for a SVM classifier with RBF kernel. As I'm a beginner in programming, I tried some codes that I found online. Unfortunately, none work. I always get the error: ValueError: coef_ is only available when using a linear kernel.
This is the last code I tested:
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
vec = DictVectorizer()
feat_sel = SelectKBest(mutual_info_classif, k=200)
# Pipeline for SVM classifier
clf = SVC()
pipe = Pipeline([('vectorizer', vec),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('svc', clf)])
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)
# Now fit the pipeline using your data
pipe.fit(instances, y)
def show_most_informative_features(vec, clf, n=10):
feature_names = vec.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
return ('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2, fn_2))
print(show_most_informative_features(vec, clf))
Does someone no a way to get the top 10 features from a classifier with a RBF kernel? Or another way to visualize the best features?
I am not sure if what you are asking is possible for an RBF kernel in a similar way to the example you show (which, as your error suggests, only works with a linear kernel).
However, you could always try feature ablation; remove each feature one by one and test how that affects performance. The 10 features that most affect performance are your "top 10 features".
Obviously, this is only possible if (1) you have relatively few features and/or (2) training and testing your model does not take a long time.
There are at least two options available for feature selection for an SVM classifier with RBF kernel within the scikit-learn Python module
If you are performing univariate classification, you can use SelectKBest
For classification feature selection, one of the following scoring functions can be specified within SelectKBest :
chi-squared , for non-negative features
ANOVA F-value , assuming linear dependency
Mutual Information, for general dependency. Requires more samples.
For sparse data sets, the SequentialFeatureSelector can progressively add or remove features (forward or backward selection) based on the cross-validation score for the classifier.
While SFS does not require the classifier model to expose a "coef_" or "feature_importances_" attribute, it may be slow as it will require running m*k fits (i.e., adding/removing m features with k-fold cross-validation).
A recursive feature elimination (RFE) feature selection approach has also been proposed Liu et.al. in Feature selection for support vector machines with RBF kernel.
I used to believe that scikit-learn's Logistic Regression classifier (as well as SVM) automatically standardizes my data before training. The reason I used to believe it is because of the regularization parameter C that is passed to the LogisticRegression constructor: Applying regularization (as I understand it) doesn't make sense without feature scaling. For regularization to work properly, all the features should be on comparable scales. Therefore, I used to assume that when calling the LogisticRegression.fit(X) on training data X, the fit method first performs feature scaling and then starts training. In order to test my assumption I've decided to manually scale the features of X as follows:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)
Then I've initialized a LogisticRegression object with a regularization parameter C:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=10.0, random_state=0)
I've found out that training the model on X is not equivalent to training the model on X_std. That is to say, the model produced by
log_reg.fit(X_std, y)
is not similar to the model produced by
log_reg.fit(X, y)
Does that mean that scikit-learn doesn't standardize the features before training? Or maybe it does scale but by applying a different procedure? If scikit-learn doesn't perform feature scaling, how is it consistent with requiring the regularization parameter C? Should I manually standardize my data every time before fitting the model in order for regularization to make sense?
From the following note in: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I'd assume that you need to preprocess the data yourself (e.g. with a scaler from sklearn.preprocessing.)
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ is faster for large ones.
For multiclass problems, only ‘newton-cg’ and ‘lbfgs’ handle multinomial loss; ‘sag’ and ‘liblinear’ are limited to one-versus-rest schemes.
‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty.
Note that ‘sag’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.