Can sklearn random forest directly handle categorical features?

Can sklearn random forest directly handle categorical features? - python

Say I have a categorical feature, color, which takes the values
['red', 'blue', 'green', 'orange'],
and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them.
I've heard that there's no way to do this, but I'd imagine there must be a way to deal with categorical variables without arbitrarily coding them as numbers or something like that.

No, there isn't. Somebody's working on this and the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.
A notable exception is H2O. H2O has a very efficient method for handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.
This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.
This article by Nick Dingwall and Chris Potts has a very good discussion about categorical variables and tree based learners.

You have to make the categorical variable into a series of dummy variables. Yes I know its annoying and seems unnecessary but that is how sklearn works.
if you are using pandas. use pd.get_dummies, it works really well.

Maybe you can use 1~4 to replace these four color, that is, it is the number rather than the color name in that column. And then the column with number can be used in the models

No.
There are 2 types of categorical features:
Ordinal: use OrdinalEncoder
Cardinal: use LabelEncoder or OnehotEncoder
Note: Differences between LabelEncoder & OnehotEncoder:
Label: only for one column => usually we use it to encode the label
column (i.e., the target column)
Onehot: for multiple columns => can handle more features at one time

You can directly feed categorical variables to random forest using below approach:
Firstly convert categories of feature to numbers using sklearn label encoder
Secondly convert label encoded feature type to string(object)
le=LabelEncoder()
df[col]=le.fit_transform(df[col]).astype('str')
above code will solve your problem

Related

Building machine learning with a dataset which has only string values

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

Identification of redundant columns/variables in a classification case study

I have a Database with 13 columns(both categorical and numerical). The 13th column is a categorical variable SalStat which classifies weather the person is below 50k or above 50k. I am using Logical Regression for this case and want to know which columns (numerical and categorical) are redundant that is, dont affect SalStat, so that I can remove them. What function should I use for this purpose?

In my opinion you can study the correlation between your variables and remove the ones that have high correlation since they in a way give the same amount of information to your model
you can start with something like DataFrame.corr() then draw a heatmap using seaborn for better visualization seaborn.heatmap() or a more simple one with plt.imshow(data.corr()) plt.colorbar();

How to deal with the categorical variable of more than 33 000 cities?

I work in Python. I have a problem with the categorical variable - "city".
I'm building a predictive model on a large dataset-over 1 million rows.
I have over 100 features. One of them is "city", consisting of 33 000 different cities.
I use e.g. XGBoost where I need to convert categorical variables into numeric. Dummifying causes the number of features to increase strongly. XGBoost (and my 20 gb RAM) can't handle this.
Is there any other way to deal with this variable than e.g. One Hot Encoding, dummies etc.?
(When using One Hot Encoding e.g., I have performance problems, there are too many features in my model and I'm running out of memory.)
Is there any way to deal with this?

XGBoost has also since version 1.3.0 added experimental support for categorical encoding.
Copying my answer from another question.
Nov 23, 2020
XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:
1.8.7 Categorical Data
Other than users performing encoding, XGBoost has experimental support
for categorical data using gpu_hist and gpu_predictor. No special
operation needs to be done on input test data since the information
about categories is encoded into the model during training.
https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf
In the DMatrix section the docs also say:
enable_categorical (boolean, optional) – New in version 1.3.0.
Experimental support of specializing for categorical features. Do not
set to True unless you are interested in development. Currently it’s
only available for gpu_hist tree method with 1 vs rest (one hot)
categorical split. Also, JSON serialization format, gpu_predictor and
pandas input are required.
Other models option:
If you don't need to use XGBoost, you can use a model like LightGBM or or CatBoost which support categorical encoding without one-hot-encoding out of the box.

You could use some kind of embeddings that reflect better those cities (and compress the number of total features by direct OHE), maybe using some features to describe the continet where each city belongs, then some other features to describe the country/region, etc.
Note that since you didn't provide any specific detail about this task, I've used only geographical data on my example, but you could use some other variables related to each city, like the mean temprature, the population, the area, etc, depending on the task you are trying to address here.
Another approach could be replacing the city name with its coordinates (latitude and longitude). Again, this may be helpful depending on the task for your model.
Hope this helps

Beside the models, you could also decrease the number of the features (cities) by grouping them in geographical regions. Another option is grouping them by population size.
Another option is grouping them by their frequency by using quantile bins. Target encoding might be another option for you.
Feature engineering in many cases involves a lot of manual work, unfortunately you cannot always have everything sorted out automatically.

There are already great responses here.
Other technique I would use is cluster those cities into groups using K-means clustering with some of the features specific to cities in your dataset.
By this way you could use the cluster number in place of the actual city. This could reduce the number of levels quite a bit.

Categorical variable in Python/Scikitlearn without one-hot encoding

Is it possible to use a categorical variable, as-is, in Python/Scikit-learn GLM models? I do realize the alternative of one-hot encoding. My issue with this approach is that I will be unable to test the entire variable for significance. I can only test the encoded variable (which is partial).
Why is it that SAS can handle such a variable and not Python? Please advise.

It actually depends on the data that you have. For instance, if you can assign some sort of order to the categorical variable (Ordinal Values) like low,medium and high, you can assign them numbers like 1, 2 and 3. However, it gets a little trickier if there is no order whatsoever. Besides One-hot Encoding, you can try Helmert Coding Scheme. You can also read this blog post for more analysis. There are also various other coding schemes in sklearn for categorical variables:
Sum Coding
Polynomial Coding
Backward Difference Coding
You can read more about other Categorical Encoders in Sklearn here.

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.
As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.
My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.
In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line
"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97
I have around 900K lines for learning and I do my test over 100K lines
As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.
I tried several things:
LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs
I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.
I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.
Is there any way I didn't explore that can fit my needs?
Thanks for any clue and piece of advice.

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.