Feature selection on binary dataset(categorical)

Feature selection on binary dataset(categorical) - python

My dataset has 32 categorical variable, and one numerical continous variable(sales_volume)
First I transformed categorical variables to binary with one-hot encoding (pd.get_dummies) and now I have 1294 columns since every column has several categorical variable.
Now I want to reduce them before using any dimensional reduction techniques.
What is the best option to select the most effective variables?
For example; one categorical variable has two answers 'yes' and 'no'. Is it possible to 'yes' column has significant importance and 'no' column has nothing to explain? Would you drop the question('yes' and 'no' columns) or just 'no' column?
Thanks in advance.

On sklearn you could use sklearn.feature_selection.SelectFromModel which enables you to fit a model to all your features and pick only the features that have more importance in that model, for example a RandomForest. The get_support() method then gets you the important features.
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
clf = RandomForestClassifier()
sfm = SelectFromModel(clf)
sfm.fit(X,y)
sfm.get_support()

Related

All independent variables are categorical and dependent(target) variable is continuous

This is the data I need to build model upon:
dataframe
dataframe contains 834 rows and 4 columns('Size','Sector','Road Connectivity','Price')
AIM is to train a model so as to predict the price
'Size','Sector' and 'Road connectivity' are 3 features which are assigned to X variable.
'Price' i.e our target feature is assigned to y variable
i have made a pipeline which consists of one hot encoder and linear regressor
below is the code:
ohc=OneHotEncoder(categories = "auto")
lr=LinearRegression(fit_intercept=True,normalize=True)
pipe=make_pipeline(ohc,lr)
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
kfolds=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
cross_val_score(pipe,X,y,cv=kfolds).mean()
output =0.8970496076598085
xinp=([['04M','Sec 10','C road']])
pipe.fit(X,y)
pipe.predict(xinp)
now when I pass the values to pipeline to predict it shows an error:
"""Found unknown categories ['Sec 10'] in column 1 during transform"""
ANY SUGGESTIONS which help build the model are appreciated...

It looks like you provided category (in xinp, the Sec 10 value), that was not present in training data, thus it can not be one hot encoded, because there is no dummy variable (no corresponding binary column) for it. One of possible solutions can be following:
ohc=OneHotEncoder(categories = "auto", handle_unknown = "ignore")
From scikit one hot encoder documentation:
handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this
parameter is set to ‘ignore’ and an unknown category is encountered
during transform, the resulting one-hot encoded columns for this
feature will be all zeros. In the inverse transform, an unknown
category will be denoted as None.

OneHotEncoding after LabelEncoding

In Sklearn how can I do OneHotEncoding after LabelEncoding in Sklearn.
What i have done so far is that i mapped all the string features of my dataset like such.
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()
After that i applied this to the dataset columns, with indexing like such:
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
My results were not super good, so what I want to do, is that I want to use ÒneHotEncoding to see if performance is improved.
This is my code:
ohe = OneHotEncoder(categorical_features = categorical_cols)
X[categorical_cols] = ohe.fit_transform(df).toarray()
I have tried different approaches, but what i try to accomplish here is using the OneHotEncoding technique to overwrite the features.

OneHotEncoder directly supports categorical features, so no need to use a LabelEncoder prior to using it. Also note, that you should not use a LabelEncoder to encode features. Check LabelEncoder for features? for a detailed explanation on this. A LabelEncoder only makes sense on the actual target here.
So select the categorical columns (df.select_dtypes is normally used here), and fit on the specified columns. Here's a sketch one how you could proceed:
# OneHot encoding categorical columns
oh_cols = df.select_dtypes('object').columns
X_cat = df[oh_cols].to_numpy()
oh = OneHotEncoder()
one_hot_cols = oh.fit(X_cat)
Then just call the transform method of the encoder. If you wanted to reconstruct the dataframe (as your code suggests) get_feature_names will give you the category names of the categorical features:
df_prepr = pd.DataFrame(one_hot_cols.transform(X_cat).toarray(),
columns=one_hot_cols.get_feature_names(input_features=oh_cols))

Using encoded target value

I have a pandas dataframe and one column of it is my target value which is categorical.
I used get_dummies for encoding my target value. Now, I have my encoded target value in 5 encoded column because my target value has 5 categories.
My question is that how can I consider all of these 5 columns in linear regression method?
I have x_dummies as my dependent values dataframe and y_dummies as my target value data frame with 5 columns of encoded values.
I have never had a target value in more than one column!
Is this correct?
Link to Assingment:
https://www.cs.waikato.ac.nz/~eibe/pubs/ordinal_tech_report.pdf
regr = linear_model.LinearRegression()
regr.fit( x_dummies_training, y_dummies_training)

If your target is categorical you may want to use a classifier , not a regressor.
You may read this article to understand the difference if you want .
So in your case you would want to use a classifier and keep your y target as one variable instead of one hot encoding it.
If you want a mathematical model that's easy to interpret ( i guessed that from your use of Linear Regression) you may want a multinomial logistic regression:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, solver='lbfgs',
multi_class='multinomial').fit(X, y)
You may want to check the sklearn documentation .
You could also try the wildly popular boosting trees methods that should give you better results : check catboost as an example .

Using chi2 test for feature selection with continuous features (Scikit Learn)

I am trying to predict a binary (categorical) target from many continuous features, and would like to narrow your feature space before heading into model fitting. I noticed that the SelectKBest class from SKLearn's Feature Selection package has the following example on the Iris dataset (which is also predicting a binary target from continuous features):
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150,2)
The example uses the chi2 test to determine which features should be used in the model. However it is my understanding that the chi2 test is strictly meant to be used in situations where we have categorical features predicting categorical performance. I did not think the chi2 test could be used for scenarios like this. Is my understanding wrong? Can the chi2 test be used to test whether a categorical variable is dependent on a continuous variable?

The SelectKBest function with the chi2 test only works with categorical data. In fact, the result from the test only will have real meaning if the feature only has 1's and 0's.
If you inspect a little bit the implementation of chi2 you going to see that the code only apply a sum across each feature, which means that the function expects just binary values. Also, the parameters that receive the chi2 function indicate the following:
def chi2(X, y):
...
X : {array-like, sparse matrix}, shape = (n_samples, n_features_in)
Sample vectors.
y : array-like, shape = (n_samples,)
Target vector (class labels).
Which means that the function expects to receive the feature vector with all their samples. But later when the expected values are calculated, you will see:
feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)
And these lines of code only make sense if the X and Y vector only has 1's and 0's.

I agree with #lalfab however, it's not clear to me why sklearn provides an example of using chi2 on the iris dataset which has all continuous variables. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)

My understand of this is that when using Chi2 for feature selection, the dependent variable has to be categorical type, but the independent variables can be either categorical or continuous variables, as long as it's non-negative. What the algorithm trying to do is firstly build a contingency table in a matrix format that reveals the multivariate frequency distribution of the variables. Then try to find the dependence structure underlying the variables using this contingency table. The Chi2 is one way to measure the dependency.
From the Wikipedia on contingency table (https://en.wikipedia.org/wiki/Contingency_table, 2020-07-04):
Standard contents of a contingency table
Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to as banner points or cuts (and the rows are sometimes referred to as stubs).
Significance tests. Typically, either column comparisons, which test for differences between columns and display these results using letters, or, cell comparisons, which use color or arrows to identify a cell in a table that stands out in some way.
Nets or netts which are sub-totals.
One or more of: percentages, row percentages, column percentages, indexes or averages.
Unweighted sample sizes (counts).
Based on this, pure binary features can be easily summed up as counts, which is how people conduct the Chi2 test usually. But as long as the features are non-negative, one can always accumulated it in the contingency table in a "meaningful" way. In the sklearn implementation, it sums up as feature_count = X.sum(axis=0), then later averaged on class_prob.

In my understanding, you cannot use chi-square (chi2) for continuous variables.The chi2 calculation requires to build the contingency table, where you count occurrences of each category of the variables of interest. As the cells in that RC table correspond to particular categories, I cannot see how such table could be built from continuous variables without significant preprocessing.
So, the iris example which you quote, in my view, is an example of incorrect usage.
But there are more problems with the existing implementation of the chi2 feature reduction in Scikit-learn. First, as #lalfab wrote, the implementation requires binary feature, but the documentation is not clear about this. This led to common perception in the community that SelectKBest could be used for categorical features, while in fact it cannot. Second, the Scikit-learn implementation fails to implement the chi2 condition (80% cells of RC table need to have expected count >=5) which leads to incorrect results if some categorical features have many possible values. All in all, in my view this method should not be used neither for continuous, nor for categorical features (except binary). I wrote more about this below:
Here is the Scikit-learn bug request #21455:
and here the article and the alternative implementation:

Passing categorical data to Sklearn Decision Tree

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
Some advantages of decision trees are:
(...)
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
But running the following script
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
outputs the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

(This is just a reformat of my comment above from 2016...it still holds true.)
The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

(..)
Able to handle both numerical and categorical data.
This only means that you can use
the DecisionTreeClassifier class for classification problems
the DecisionTreeRegressor class for regression.
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.

As of v0.24.0, scikit supports the use of categorical features in HistGradientBoostingClassifier and HistGradientBoostingRegressor natively!
To enable categorical support, a boolean mask can be passed to the categorical_features parameter, indicating which feature is categorical. In the following, the first feature will be treated as categorical and the second feature as numerical:
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])
Equivalently, one can pass a list of integers indicating the indices of the categorical features:
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])
You still need to encode your strings, otherwise you will get "could not convert string to float" error. See here for an example on using OrdinalEncoder to convert strings to integers.

Yes decision tree is able to handle both numerical and categorical data.
Which holds true for theoretical part, but during implementation, you should try either OrdinalEncoder or one-hot-encoding for the categorical features before training or testing the model. Always remember that ml models don't understand anything other than Numbers.

Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:
def cat2int(column):
vals = list(set(column))
for i, string in enumerate(column):
column[i] = vals.index(string)
return column

you can apply some conversion method like one hot encoding to transform your categorical data into numeric entities and then create the tree
Refer this URL for more information:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

With sklearn classifiers, you can model categorical variables both as an input and as an output.
Let's assume you have categorical predictors and categorical labels (i.e. multi-class classification task). Moreover, you want to handle missing or unknown labels for both predictors and labels.
First thing you need encoder like OrdinalEncoder.
Basic example:
# encoders
from sklearn.preprocessing import OrdinalEncoder
input_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1)
output_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1 )
input_enc.fit(df[['Attribute A','Attribute B']].values)
output_enc.fit(df[['Label']].values)
# build classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
X = input_enc.transform(df[['Attribute A','Attribute B']].values)
Y = output_enc.transform(df[['Label']].values)
clf.fit(X, Y)
# predict
predicted = clf.predict(input_enc.transform([('Value 1', 'Value 2')]))
predicted_label = output_enc.inverse_transform([predicted])
If you use df[...].values, your encoder will not store attribute names (column names). This does not matter, as long as same format is used for enc.transform() or enc.inverse_transofrm() (otherwise you will a warning).
OrdinalEncoder by default does not handle nan values and they are not handled by cls.fit(). This is solved by encoded_missing_value param.
In prediction phase, by default encoder will throw an error when ask to transform unknown labels. This is handled by handle_unknown param.

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.
Refer to the following code from the documentation:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:
list(le.inverse_transform([2, 2, 1]))
This would return ['tokyo', 'tokyo', 'paris'].
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.