I'm trying to get the top 10 most informative (best) features for a SVM classifier with RBF kernel. As I'm a beginner in programming, I tried some codes that I found online. Unfortunately, none work. I always get the error: ValueError: coef_ is only available when using a linear kernel.
This is the last code I tested:
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
vec = DictVectorizer()
feat_sel = SelectKBest(mutual_info_classif, k=200)
# Pipeline for SVM classifier
clf = SVC()
pipe = Pipeline([('vectorizer', vec),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('svc', clf)])
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)
# Now fit the pipeline using your data
pipe.fit(instances, y)
def show_most_informative_features(vec, clf, n=10):
feature_names = vec.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
return ('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2, fn_2))
print(show_most_informative_features(vec, clf))
Does someone no a way to get the top 10 features from a classifier with a RBF kernel? Or another way to visualize the best features?
I am not sure if what you are asking is possible for an RBF kernel in a similar way to the example you show (which, as your error suggests, only works with a linear kernel).
However, you could always try feature ablation; remove each feature one by one and test how that affects performance. The 10 features that most affect performance are your "top 10 features".
Obviously, this is only possible if (1) you have relatively few features and/or (2) training and testing your model does not take a long time.
There are at least two options available for feature selection for an SVM classifier with RBF kernel within the scikit-learn Python module
If you are performing univariate classification, you can use SelectKBest
For classification feature selection, one of the following scoring functions can be specified within SelectKBest :
chi-squared , for non-negative features
ANOVA F-value , assuming linear dependency
Mutual Information, for general dependency. Requires more samples.
For sparse data sets, the SequentialFeatureSelector can progressively add or remove features (forward or backward selection) based on the cross-validation score for the classifier.
While SFS does not require the classifier model to expose a "coef_" or "feature_importances_" attribute, it may be slow as it will require running m*k fits (i.e., adding/removing m features with k-fold cross-validation).
A recursive feature elimination (RFE) feature selection approach has also been proposed Liu et.al. in Feature selection for support vector machines with RBF kernel.
Related
Two questions:
I'm trying to run a model that predicts churn. A lot of my features have multicollinearity issues. To address this problem I'm trying to penalize the coefficients with Ridge.
More specifically I'm trying to run a logistic regression but apply Ridge penalties (not sure if that makes sense) to the model...
Questions:
Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e. LogisticRegression(apply_penality=Ridge)
I'm trying to determine feature importance and through some research, it seems like I need to use this:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
However, I'm confused on how to access this function if my model has been built around sklearn.pipeline.make_pipeline function.
I'm just trying to figure out which independent variables have the most importance in predicting my label.
Code below for reference
#prep data
X_prep = df_dummy.drop(columns='CHURN_FLAG')
#set predictor and target variables
X = X_prep #all features except churn_flag
y = df_dummy["CHURN_FLAG"]
#create train /test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
'''
standard scaler - incoming data needs to standardized before any other transformation is performed on it.
SelectKBest() -> This method comes from the feature_selection module of Scikit-learn. It selects the best features based on a specified scoring function (in this case, f_regression)
The number of features is specified by the value of parameter k. Even within the selected features, we want to vary the final set of features that are fed to the model, and find what performs best. We can do that with the GridSearchCV method
Ridge() -> This is an estimator that performs the actual regression -- performed to reduce the effect of multicollinearity.
GridSearchCV _> Other than searching over all the permutations of the selected parameters, GridSearchCV performs cross-validation on training data.
'''
#Setting up a pipeline
pipe= make_pipeline(StandardScaler(),SelectKBest(f_regression),Ridge())
#A quick way to get a list of parameters that a pipeline can accept
#pipe.get_params().keys()
#putting together a parameter grid to search over using grid search
params={
'selectkbest__k':[1,2,3,4,5,6],
'ridge__fit_intercept':[True,False],
'ridge__alpha':[0.01,0.1,1,10],
'ridge__solver':[ 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag',
'saga']
}
#setting up the grid search
gs=GridSearchCV(pipe,params,n_jobs=-1,cv=5)
#fitting gs to training data
gs.fit(Xtrain, ytrain)
#building a dataframe from cross-validation data
df_cv_scores=pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')
#selecting specific columns to create a view
df_cv_scores[['params','split0_test_score', 'split1_test_score', 'split2_test_score',
'split3_test_score', 'split4_test_score', 'mean_test_score',
'std_test_score', 'rank_test_score']].head()
#checking the selected permutation of parameters
gs.best_params_
'''
Finally, we can predict target values for test set by passing it’s feature matrix to gs.
The predicted values can be compared with actual target values to visualize and communicate performance of the model.
'''
#checking how well the model does on the holdout-set
gs.score(Xtest,ytest)
#plotting predicted churn/active vs actual churn/active
y_preds=gs.predict(Xtest)
plt.scatter(ytest,y_preds)
Would selecting a ridge regression classifier suffice this? Or do I need to select logistic regression classifier and append it with some param for ridge penalty (i.e. LogisticRegression(apply_penality=Ridge)
So the question between ridge regression and Logistic here comes down to whether or not you are trying to do classification or regression. If you want to predict the quantity of churn on some continuous basis use ridge, if you want to predict did someone churn or are they likely to churn use logistic regression.
Sklearn's LogisticRegression uses l2 normalization by default which is equivalent to the regularization used by ridge regeression. So you should be fine using that if it's the regularization that you want : )
I'm trying to determine feature importance and through some research, it seems like I need to use this.
In general you can access the elements of a pipeline through the named_steps attribute. so in your case if you wanted to access SelectKBest you could do:
pipe.named_steps["SelectKBest"].get_feature_names()
That's going to get you the feature names, now you still need the values. Here you have to access your models learned coefficients. For ridge and logistic regression it should be something like:
pipe.named_steps["logisticregression"].coef_
I have a blog post about this if you want a more detailed tutorial here
I am new in NLP and it is a bit confusing me.
I am trying to do a text classification with SVC on my dataset.
I have an imbalanced dataset of 6 classes.
The text is news for classes of health, sport, culture, economy, science and web.
I am using TF-IDF for vectorization.
the preprocessing steps: lower-case all the texts and to remove the stop-words. since my text is in German I did not use lemmatization
my first try:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
X_train = train['text']
y_train = train['category']
X_test = test['text']
y_test = test['category']
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score was: 93%
then I decided to reduce the dimensionality: so on my 2nd try I added TruncatedSVD
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),('svd', TruncatedSVD()),
('clf', LinearSVC()),])
predictions = text_clf_lsvc.predict(X_test)
my metrci acuuracy score dropped to 34%.
my questions:
1- How can I improve my model if I want to stick to TF-IDF and SVC for classification
2- What can I do other than that if I want to have a good classification
The best way to improve accuracy, given that you want to stick with this configuration is through hyperparameter tuning, or by introducing additional components, such as feature selection.
Hyperparameter tuning
Most machine learning algorithms and parts of a machine learning pipeline have several parameters you can change. For example, the TfidfVectorizer has different ngram ranges, different analysis levels, different tokenizers, and many more parameters to vary. Most of these will affect your performance. So, what you can do is systematically vary these parameters (and those of your SVC), while monitoring you accuracy on a development set (i.e., not the test data!). Instead of fixed development set, cross-validation is typically used in these kinds of settings.
The best way to do this in sklearn is through a RandomizedSearchCV (see here for details). This class automatically cross-validates and searches through the possible options you pre-specify by randomly sampling from the option set for a fixed number of iterations. By applying this technique on your training data, you will automatically find models that perform better for your given training data and your options. Ideally, these models would also perform better on your test data. Fair warning: cross-validated search techniques can take a while to run.
Feature Selection
In addition to grid search, another way to improve performance is through feature selection. Feature selection typically consists of a statistical test that determines which features explain variance in the typical task you are trying to solve. The feature selection methods in sklearn are detailed here.
By far the most important bit here is that the performance of anything you add to your model should be verified on an independent development set, or in cross-validation. Leave your test data alone.
Say I have a sklearn training data:
features, labels = assign_dataSets() #assignment operation
Here the feature is a 2-D array, whereas label consists is a 1-D array consisting of values [0,1]
The classification operation:
f1x = [features[i][0] for i in range(0, len(features)) if labels[i]==0]
f2x = [features[i][0] for i in range(0, len(features)) if labels[i]==1]
f1y = [features[i][1] for i in range(0, len(features)) if labels[i]==0]
f2y = [features[i][1] for i in range(0, len(features)) if labels[i]==1]
Now I plot the said data:
import matplotlib.pyplot as plt
plt.scatter(f1x,f1y,color='b')
plt.scatter(f2x,f2y,color='y')
plt.show()
Now I want to run the fitting operation with a classifier for example SVC.
from sklearn.svm import SVC
clf = SVC()
clf.fit(features, labels)
Now my question is as support vectors are really slow, is there a way to monitor the decision boundary of the classifier in real-time (I mean as the fitting operation is occurring)? I know that I can plot the decision boundary after the fitting operation has occurred, but I want the plotting of the classifier to occur in real time. Perhaps with threading and running predictions of an array of points declared by a linespace. Does fit function even allow such operations, or do I need to go for a some other library?
Just so you know, I am new to machine-learning.
scikit-learn has this feature, but it's is limited to a few classifiers from my understanding (e.g. GradientBoostingClassifier, MPLClassifier). To turn on this feature, you need to set verbose=True. For example:
clf = GradientBoostingClassifier(verbose=True)
I tried it with SVC and didn't work as expected (probably for the reason sascha mentioned in the comment section). Here is a different variation of your question on StackOverflow.
With regards to your second question, if you switch to Tensorflow (another machine learning library), you can use the tensorboard feature to monitor a few of metrics (e.g. error decay) in real time.
However, to the best of my knowledge SVM implementation is still experimental in v1.5. Tensorflow is really good when working with neural network based models.
If you decide to use a DNN for classification using Tensorflow then here is a discussion about implementation on StackOverflow: No easy way to add Tensorboard output to pre-defined estimator functions DnnClassifier?
Useful References:
Tensorflow SVM (only linear support for now - v1.5): https://www.tensorflow.org/api_docs/python/tf/contrib/learn/SVM
Tensorflow Kernals Methods: https://www.tensorflow.org/versions/master/tutorials/kernel_methods
Tensorflow Tensorboard: https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard
Tensorflow DNNClassifier Estimator: https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier
I used to believe that scikit-learn's Logistic Regression classifier (as well as SVM) automatically standardizes my data before training. The reason I used to believe it is because of the regularization parameter C that is passed to the LogisticRegression constructor: Applying regularization (as I understand it) doesn't make sense without feature scaling. For regularization to work properly, all the features should be on comparable scales. Therefore, I used to assume that when calling the LogisticRegression.fit(X) on training data X, the fit method first performs feature scaling and then starts training. In order to test my assumption I've decided to manually scale the features of X as follows:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)
Then I've initialized a LogisticRegression object with a regularization parameter C:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=10.0, random_state=0)
I've found out that training the model on X is not equivalent to training the model on X_std. That is to say, the model produced by
log_reg.fit(X_std, y)
is not similar to the model produced by
log_reg.fit(X, y)
Does that mean that scikit-learn doesn't standardize the features before training? Or maybe it does scale but by applying a different procedure? If scikit-learn doesn't perform feature scaling, how is it consistent with requiring the regularization parameter C? Should I manually standardize my data every time before fitting the model in order for regularization to make sense?
From the following note in: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
I'd assume that you need to preprocess the data yourself (e.g. with a scaler from sklearn.preprocessing.)
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’}
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ is faster for large ones.
For multiclass problems, only ‘newton-cg’ and ‘lbfgs’ handle multinomial loss; ‘sag’ and ‘liblinear’ are limited to one-versus-rest schemes.
‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty.
Note that ‘sag’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.
The literature on machine learning strongly suggests normalization of data for SVM (Preprocessing data in scikit-learn). And as answered before, same StandardScalar should be applied to both training and test data.
What is the advantages of using StandardScalar over manually subtracting the mean and dividing by standard deviation (other than the ability to use it in a pipeline)?
LinearSVC in scikit-learn depends on one-vs-the-rest for multiple classes (as larsmans mentioned, SVC depends on one-vs-one for multi-class). So what would happen if I have multiple classes trained with a pipeline with normalization as the first estimator? Would it also calculate the mean and standard variation of the each class, and use it during classification?
To be more specific, does the following classifier apply different mean and standard deviations to each class before svm stage of pipeline?
estimators = [('normalize', StandardScaler()), ('svm', SVC(class_weight = 'auto'))]
clf = Pipeline(estimators)
# Training
clf.fit(X_train, y)
# Classification
clf.predict(X_test)
The feature scaling performed by StandardScaler is performed without reference to the target classes. It only considers the X feature matrix. It calculates the mean and standard deviation of each feature across all samples, irrespective of the target class of each sample.
Each component of the pipeline operates independently: only the data is passed between them. Let's expand the pipeline's clf.fit(X_train, y). It roughly does the following:
X_train_scaled = clf.named_steps['normalize'].fit_transform(X_train, y)
clf.named_steps['svm'].fit(X_train_scaled, y)
The first scaling step actually ignores the y it is passed, but calculates the mean and standard deviation of each feature in X_train and stores them in its mean_ and std_ attributes (the fit component). It also centers X_train and returns it (the transform component). The next step learns an SVM model, and does what is necessary for one-vs-rest.
Now the pipeline's perspective for classification. clf.predict(X_test) expands to:
X_test_scaled = clf.named_steps['normalize'].transform(X_test)
y_pred = clf.named_steps['svm'].predict(X_test_scaled)
returning y_pred. In the first line it uses the stored mean_ and std_ to apply the transformation to X_test using parameters learnt from the training data.
Yes, the scaling algorithm isn't very complicated. It just subtracts the mean and divides by the std. But StandardScalar:
provides a name to the algorithm so you can pull it out of the library
avoids you rolling your own, ensuring it works correctly, and not requiring you to understand what it's doing on the inside
remembers the parameters from a fit or fit_transform for later transform operations (as above)
provides the same interface as other data transformations (and hence can be used in a pipeline)
operates over dense or sparse matrices
is able to reverse the transformation with its inverse_transform method