Using chi2 test for feature selection with continuous features (Scikit Learn)

Using chi2 test for feature selection with continuous features (Scikit Learn) - python

I am trying to predict a binary (categorical) target from many continuous features, and would like to narrow your feature space before heading into model fitting. I noticed that the SelectKBest class from SKLearn's Feature Selection package has the following example on the Iris dataset (which is also predicting a binary target from continuous features):
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150,2)
The example uses the chi2 test to determine which features should be used in the model. However it is my understanding that the chi2 test is strictly meant to be used in situations where we have categorical features predicting categorical performance. I did not think the chi2 test could be used for scenarios like this. Is my understanding wrong? Can the chi2 test be used to test whether a categorical variable is dependent on a continuous variable?

The SelectKBest function with the chi2 test only works with categorical data. In fact, the result from the test only will have real meaning if the feature only has 1's and 0's.
If you inspect a little bit the implementation of chi2 you going to see that the code only apply a sum across each feature, which means that the function expects just binary values. Also, the parameters that receive the chi2 function indicate the following:
def chi2(X, y):
...
X : {array-like, sparse matrix}, shape = (n_samples, n_features_in)
Sample vectors.
y : array-like, shape = (n_samples,)
Target vector (class labels).
Which means that the function expects to receive the feature vector with all their samples. But later when the expected values are calculated, you will see:
feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)
And these lines of code only make sense if the X and Y vector only has 1's and 0's.

I agree with #lalfab however, it's not clear to me why sklearn provides an example of using chi2 on the iris dataset which has all continuous variables. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)

My understand of this is that when using Chi2 for feature selection, the dependent variable has to be categorical type, but the independent variables can be either categorical or continuous variables, as long as it's non-negative. What the algorithm trying to do is firstly build a contingency table in a matrix format that reveals the multivariate frequency distribution of the variables. Then try to find the dependence structure underlying the variables using this contingency table. The Chi2 is one way to measure the dependency.
From the Wikipedia on contingency table (https://en.wikipedia.org/wiki/Contingency_table, 2020-07-04):
Standard contents of a contingency table
Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to as banner points or cuts (and the rows are sometimes referred to as stubs).
Significance tests. Typically, either column comparisons, which test for differences between columns and display these results using letters, or, cell comparisons, which use color or arrows to identify a cell in a table that stands out in some way.
Nets or netts which are sub-totals.
One or more of: percentages, row percentages, column percentages, indexes or averages.
Unweighted sample sizes (counts).
Based on this, pure binary features can be easily summed up as counts, which is how people conduct the Chi2 test usually. But as long as the features are non-negative, one can always accumulated it in the contingency table in a "meaningful" way. In the sklearn implementation, it sums up as feature_count = X.sum(axis=0), then later averaged on class_prob.

In my understanding, you cannot use chi-square (chi2) for continuous variables.The chi2 calculation requires to build the contingency table, where you count occurrences of each category of the variables of interest. As the cells in that RC table correspond to particular categories, I cannot see how such table could be built from continuous variables without significant preprocessing.
So, the iris example which you quote, in my view, is an example of incorrect usage.
But there are more problems with the existing implementation of the chi2 feature reduction in Scikit-learn. First, as #lalfab wrote, the implementation requires binary feature, but the documentation is not clear about this. This led to common perception in the community that SelectKBest could be used for categorical features, while in fact it cannot. Second, the Scikit-learn implementation fails to implement the chi2 condition (80% cells of RC table need to have expected count >=5) which leads to incorrect results if some categorical features have many possible values. All in all, in my view this method should not be used neither for continuous, nor for categorical features (except binary). I wrote more about this below:
Here is the Scikit-learn bug request #21455:
and here the article and the alternative implementation:

Related

Difference between Normalizer and MinMaxScaler

I'm trying to understand the effects of applying the Normalizer or applying MinMaxScaler or applying both in my data. I've read the docs in SKlearn, and saw some examples of use. I understand that MinMaxScaler is important (is important to scale the features), but what about Normalizer?
It keeps unclear to me the practical result of using the Normamlizer in my data.
MinMaxScaler is applied column-wise, Normalizer is apllied row-wise. What does it implies? Should I use the Normalizer or just use the MinMaxScale or should use then both?

As you have said,
MinMaxScaler is applied column-wise, Normalizer is applied row-wise.
Do not confuse Normalizer with MinMaxScaler. The Normalizer class from Sklearn normalizes samples individually to unit norm. It is not column based but a row-based normalization technique. In other words, the range will be determined either by rows or columns.
So, remember that we scale features not records, because we want features to have the same scale, so the model to be trained will not give different weights to different features based on their range. If we scale the records, this will give each record its own scale, which is not what we need.
So, if features are represented by rows, then you should use the Normalizer. But in most cases, features are represented by columns, so you should use one of the scalers from Sklearn depending on the case:
MinMaxScaler transforms features by scaling each feature to a given range. It scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
The transformation is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where min, max = feature_range.
This transformation is often used as an alternative to zero mean, unit variance scaling.
StandardScaler standardizes features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as:
z = (x - u) / s. Use this if the data distribution is normal.
RobustScaler is robust to outliers. It removes the median and scales the data according to IQR (Interquartile Range). The IQR is the range between the 25th quantile and the 75th quantile.

Different F1 scores for different preprocessing techniques- sklearn

I am building a classification model using sklearn's GradientBoostingClassifier. For the same model, I tried different preprocessing techniques: StandarScaler, Scale, and Normalizer on the same data but I am getting different f1_scores each time. For StandardScaler, it is highest and lowest for Normalizer. Why is it so? Is there any other technique for which I can get an even higher score?

The difference lies in their respective definitions:
StandardScaler: Standardize features by removing the mean and scaling to unit variance
Normalizer: Normalize samples individually to unit norm.
Scale: Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
The data used to fit your model will change, so will the F1 score.
Here is a useful link comparing different scalers : https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

Getting probability values for random forest and Gradient Boosting in python

I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.

Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.

feature importance with ExtraTreesClassifier return all zeros

I want to compute the features importances of a given dataset using ExtraTreesClassifier. My target is to find high scored features for further classification processes. X dataset has a size (10000, 50) where 50 columns are the features and this dataset represents only data collected from one user (i.e., from the same class) and Y is the labels (all zeros).
However, the output return all features importances as zeros!!
Code:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,Y)
X = pd.DataFrame(X)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()
Output:
Can anyone tell me why all features have zero importance scores?

If the class labels all have the same value then the feature importances will all be 0.
I am not familiar enough with the algorithms to give a technical explanation as to why the importances are returned as 0 rather than nan or similar, but from a theoretical perspective:
You are using an ExtraTreesClassifier which is an ensemble of decision trees. Each of these decision trees will attempt to differentiate between samples of different classes in the target by minimizing impurity in some way (gini or entropy in the case of sklearn extra trees). When the target only contains samples of a single class, the impurity is already at the minimum, so no splits are required by the decision tree to reduce impurity any further. As such, no features are required to reduce impurity, so each feature will have an importance of 0.
Consider this another way. Each feature has exactly the same relationship with the target as any other feature: the target is 0 no matter what the value of the feature is for a specific sample (the target is completely independent of the feature). As such, each feature provides no new information about the target and so has no value for making a prediction.

You can try doing:
print(sorted(zip(model.feature_importances_, X.columns), reverse=True))

Scaling test data to 0 and 1 using MinMaxScaler

Using the MinMaxScaler from sklearn, I scale my data as below.
min_max_scaler = preprocessing.MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(features_train)
X_test_scaled = min_max_scaler.transform(features_test)
However, when printing X_test_scaled.min(), I have some negative values (the values do not fall between 0 and 1). This is due to the fact that the lowest value in my test data was lower than the train data, of which the min max scaler was fit.
How much effect does not having exactly normalized data between 0 and 1 values have on the SVM classifier? Also, is it bad practice to concatenate the train and test data into a single matrix, perform min-max scaling to ensure values are between 0 and 1, then seperate them again?

If you can scale all your data in one shot this would be better because all your data are managed by the Scaler in a logical way (all between 0 and 1). But for the SVM algorithm, there must be no difference as the scaler will extend the scale. There's still the same difference even if it is negative.
In the documentation we can see that there are negative values so I don't think it has an impact on the result

For this scaling it probably doesn't matter much in practice, but in general you should not use your test data to estimate any parameters of the preprocessing. This can severely bias you results for more complex preprocessing steps.
There is really no reason why you would want to concatenate the data here, the SVM will deal with it.
If you would be using a model that needs positive values and your test data is not made positive, you might consider another strategy than the MinMaxScaler.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.