Selecting Best features for ML - python

Is there any way to extract best features from the data. Right now, I am using 'KBest' from sklearn.
In this, I have to specify number of K best features that needs to be selected.
Is there any way in which I don't have to specify the number of features to be extracted? Rather we extract all the useful features?
from sklearn.feature_selection import SelectKBest
test = SelectKBest(score_func=chi2, k=4)

You can use "all" instead of a number
test = SelectKBest(score_func=chi2, k="all")
From docs
k : int or “all”, optional, default=10
Number of top features to select. The “all” option bypasses selection, for use in a parameter
search.

Many ways to select features. In wiki, you can find them.And I think the best feature selection method is that you have a deep understanding of these features.But usually we have a hard time understanding them.
Maybe you can use 5-Kfold cross-validation to make a feature importance ranking, and them select important feature from it.
And you also can use Embedded method to select it, like this:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
#Feature selection of GBDT as base model
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)
It's worth noting that you cannot delete a feature that seems to be useless alone,because it may be related to other features.So feature selection is a greedy search process, which is often time consuming.

Related

Forward feature selection with custom criterion

I am trying to get the best features for my data for classification. For this I want try feature selection using SVM, KNN, LDA and QDA.
Also the way to test this data is a leave one out approach and not cross-validation by splitting data into parts (basically can't split one file/matrix but have to leave one file for testing while training with other files)
I tried using sfs with SVM in Matlab but keep getting only the first feature and nothing else (there are 254 features)
Is there any way to do this in Python or Matlab ?
If you're trying to code the feature selector from scratch, I think you'd better first get deeper in the theory of your algorithm of choice.
But if you're looking for a way to get results faster, scikit-learn provides you with a variety of tools for feature selection. Have a look at this page.

Pandas info for 100+ features

I have the dataset in my disposal which consists of around 500 columns which I need to explore and keep only relevant columns. Pandas info(verbose = True) method does not even display this number properly. I also used missingno library to visualise nulls. However, it uses a lot of RAM. What to use instead of matplotlib here?
How do you approach datasets with a lot of features (more than 100)? Any useful workflow to eliminate useless features? How to use info() or any alternative?
Yeah, also used expand options to view everything. Question here is how to set it locally?
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
UPDATE:
Methods or solutions to explore initial raw data are of interest. For instance one cell script which summarises numerical features as distributions, categorical as counts and possibly something else. This can be written by myself, however, maybe there is a library or just your function which does so?
Regarding the issue of useless features, you could easily estimate some metrics associated with feature effectiveness and filter it out using some threshold. Check out the sklearn feature selection docs.
Of course before doing that you'll have to make sure features are numeric and their representation is fit for the tests of your choice. To do that I suggest you check out sklearn pipelines (optional) and preprocessing docs.
Before estimating feature usefulness, make sure you cover missing data handling, encoding categorical variables and feature scaling.
You can use XGBoost's feature_importance attribute. Though, you first need to train your data using XGB & then using feature_importance, consider only important features (by setting a threshold of your choice)
Dimension reduction can come handy using PCA or some other algorithm.

How to get CORRECT feature importance plot in XGBOOST?

Using two different methods in XGBOOST feature importance, gives me two different most important features, which one should be believed?
Which method should be used when? I am confused.
Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import xgboost as xgb
df = sns.load_dataset('mpg')
df = df.drop(['name','origin'],axis=1)
X = df.iloc[:,1:]
y = df.iloc[:,0]
Numpy arrays
# fit the model
model_xgb_numpy = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_numpy.fit(X.to_numpy(), y.to_numpy())
plt.bar(range(len(model_xgb_numpy.feature_importances_)), model_xgb_numpy.feature_importances_)
Pandas dataframe
# fit the model
model_xgb_pandas = xgb.XGBRegressor(n_jobs=-1,objective='reg:squarederror')
model_xgb_pandas.fit(X, y)
axsub = xgb.plot_importance(model_xgb_pandas)
Problem
Numpy method shows 0th feature cylinder is most important. Pandas method shows model year is most important. Which one is the CORRECT most important feature?
References
How to get feature importance in xgboost?
Feature importance 'gain' in XGBoost
It is hard to define THE correct feature importance measure. Each has pros and cons. It is a wide topic with no golden rule as of now and I personally would suggest to read this online book by Christoph Molnar: https://christophm.github.io/interpretable-ml-book/. The book has an excellent overview of different measures and different algorithms.
As a rule of thumb, if you can not use an external package, i would choose gain, as it is more representative of what one is interested in (one typically is not interested in raw occurrence of splits on a particular features, but rather how much those splits helped), see this question for a good summary: https://datascience.stackexchange.com/q/12318/53060. If you can use other tools, shap exhibits very good behaviour and I would always choose it over build-in xgb tree measures, unless computation time is strongly constrained.
As for the difference that you directly pointed at in your question, the root of the difference comes from the fact that xgb.plot_importance uses weight as the default extracted feature importance type, while the XGBModel itself uses gain as the default type. If you configure them to use the same importance type, then you will get similar distributions (up to additional normalisation in feature_importance_ and sorting in plot_importance).
There are 3 ways to get feature importance from Xgboost:
use built-in feature importance (I prefer gain type),
use permutation-based feature importance
use SHAP values to compute feature importance
In my post I wrote code examples for all 3 methods. Personally, I'm using permutation-based feature importance. In my opinion, the built-in feature importance can show features as important after overfitting to the data(this is just an opinion based on my experience). SHAP explanations are fantastic, but sometimes computing them can be time-consuming (and you need to downsample your data).
From the answer here, which gives a neat explanation:
feature_importances_ returns weights - what we usually think of as "importance".
plot_importance returns the number of occurrences in splits.
Note: I think that the selected answer above does not actually cover the point.

Scikit learn fit estimator with predefined number of classes

So, I need to use some of the estimators in scikit-learn, namely LogisticRegression and SVM, but I have a problem, I have an extremely unbalanced dataset and need to run Kfold cross validation. The thing is sometimes the fold I am fitting can have only one target class of the available ones. I wanted to know if there's any way with these estimators to predefine the number of classes, maybe something like passing them a one-hot encoding representations of the target where it doesn't matter if all the examples are from one class, the shape of the target matrix will define the number of classes already.
Is there any way to do this with scikit-learn? Maybe with another library? I know those two algorithms use liblinear, maybe there's some interface I can use in that case.
Any way, thank you for your time.
EDIT: StratifiedFold cross validation is not useful for me because sometimes I have less amount of occurrences than the number of folds. E.g. it can happen that I have a dataset with 50 instances and 3 classes, but 46 can be of one class, 2 of a second class and 2 of a third class and though I can go for 3 fold cross validation I would generally need results of more folds than that, plus even with 3 folds still leaves open the case where one class is the only available for one fold.
The comment that said you need to gather more data may be right. However if you believe you have enough data for your model to learn something useful, you can over sample your minority classes (or possibly under sample the majority classes, but this sounds like a problem for over sampling). Having only one class in the data set makes it pretty much impossible for your model to learn anything about that class.
Here are some links to over sampling and under sampling libraries in python. The famous imbalanced-learn library is great.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Your case sounds like a good candidate for SMOTE. You also mentioned you wanted to change the ratio. There is a parameter in imblearn.over_sampling.SMOTE called ratio, where you would pass a dictionary. You can also do it with percentages (see the documentation).
SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones. This is a more powerful algorithm than traditional over-sampling because then when your model gets the training data it helps avoid the issue where your model is memorizing key points of specific examples. Instead, smote creates a "similar" data point (likely in a multi-dimensional space) so your model can learn to generalize better.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (i.e. after you split), and then validate on the validation set and test sets to see if your SMOTE model out performed your other model(s). If you do not do this, there will be data leakage and you will get a model that doesn't even closely resemble what you want.
from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)
print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape
smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)
print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))

Is it possible to toggle a certain step in sklearn pipeline?

I wonder if we can set up an "optional" step in sklearn.pipeline. For example, for a classification problem, I may want to try an ExtraTreesClassifier with AND without a PCA transformation ahead of it. In practice, it might be a pipeline with an extra parameter specifying the toggle of the PCA step, so that I can optimize on it via GridSearch and etc. I don't see such an implementation in sklearn source, but is there any work-around?
Furthermore, since the possible parameter values of a following step in pipeline might depend on the parameters in a previous step (e.g., valid values of ExtraTreesClassifier.max_features depend on PCA.n_components), is it possible to specify such a conditional dependency in sklearn.pipeline and sklearn.grid_search?
Thank you!
From the docs:
Individual steps may also be replaced as parameters, and non-final
steps may be ignored by setting them to None:
from sklearn.linear_model import LogisticRegression
params = dict(reduce_dim=[None, PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)
Pipeline steps cannot currently be made optional in a grid search but you could wrap the PCA class into your own OptionalPCA component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.
For the dependent parameters problem, GridSearchCV supports trees of parameters to handle this case as demonstrated in the documentation.

Categories

Resources