Passing categorical data to Sklearn Decision Tree - python

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
Some advantages of decision trees are:
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.
But running the following script
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()[['A','B','C']], data['Class'])
outputs the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

Able to handle both numerical and categorical data.
This only means that you can use
the DecisionTreeClassifier class for classification problems
the DecisionTreeRegressor class for regression.
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True), data['Class'])

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.

As of v0.24.0, scikit supports the use of categorical features in HistGradientBoostingClassifier and HistGradientBoostingRegressor natively!
To enable categorical support, a boolean mask can be passed to the categorical_features parameter, indicating which feature is categorical. In the following, the first feature will be treated as categorical and the second feature as numerical:
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])
Equivalently, one can pass a list of integers indicating the indices of the categorical features:
>>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])
You still need to encode your strings, otherwise you will get "could not convert string to float" error. See here for an example on using OrdinalEncoder to convert strings to integers.

Yes decision tree is able to handle both numerical and categorical data.
Which holds true for theoretical part, but during implementation, you should try either OrdinalEncoder or one-hot-encoding for the categorical features before training or testing the model. Always remember that ml models don't understand anything other than Numbers.

Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:
def cat2int(column):
vals = list(set(column))
for i, string in enumerate(column):
column[i] = vals.index(string)
return column

you can apply some conversion method like one hot encoding to transform your categorical data into numeric entities and then create the tree
With sklearn classifiers, you can model categorical variables both as an input and as an output.
Let's assume you have categorical predictors and categorical labels (i.e. multi-class classification task). Moreover, you want to handle missing or unknown labels for both predictors and labels.
First thing you need encoder like OrdinalEncoder.
Basic example:
# encoders
from sklearn.preprocessing import OrdinalEncoder
input_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1)
output_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1 )[['Attribute A','Attribute B']].values)[['Label']].values)
# build classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
X = input_enc.transform(df[['Attribute A','Attribute B']].values)
Y = output_enc.transform(df[['Label']].values), Y)
# predict
predicted = clf.predict(input_enc.transform([('Value 1', 'Value 2')]))
predicted_label = output_enc.inverse_transform([predicted])
If you use df[...].values, your encoder will not store attribute names (column names). This does not matter, as long as same format is used for enc.transform() or enc.inverse_transofrm() (otherwise you will a warning).
OrdinalEncoder by default does not handle nan values and they are not handled by This is solved by encoded_missing_value param.
In prediction phase, by default encoder will throw an error when ask to transform unknown labels. This is handled by handle_unknown param.

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.
Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.
Refer to the following code from the documentation:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:
list(le.inverse_transform([2, 2, 1]))
This would return ['tokyo', 'tokyo', 'paris'].
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.
Combining sklearn pipeline and cross validation with binary columns

I want to run a regression model on a dataset with one textual column, five binary variables, and one numerical target variable. I included a CountVectorizer to vectorize the textual column, and tried to combine it in a sklearn Pipeline using make_column_transformer. The data doesn't have any missing values - yet, when running the below script, I am getting the following warning:
FitFailedWarning: Estimator fit failed. The score on this train-test
partition for these parameters will be set to nan.
and following error message:
TypeError: All estimators should implement fit and transform, or can be
'drop' or 'passthrough' specifiers. 'Level1' (type <class 'str'>) doesn't.
I assume the problem might be that I did not specify a second tuple in
make_column_transformer but merely the following:
sample_df[categorical_cols] but I am unsure how to include an already
processed, ready data in make_column_transformer.
Full code:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
('transformer', make_column_transformer((CountVectorizer(), textual_col),
('model', RandomForestRegressor())
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)
Sample dataset:
import io
data_string = """
0;0;1;0;0;Are you sure that the input;109.3
0;0;0;0;0;that the input text data for;87.2
0;0;1;0;0;text data for your model is;21.5
0;0;0;0;0;your model is in English? Well,;143.5
0;0;0;0;1;in English? Well, no one can;141.1
0;0;0;0;0;no one can be sure about;93.4
0;0;0;0;0;be sure about this, as no;29.5
0;0;0;0;0;this, as no one will read;17.9
0;0;1;0;0;one will read around 20k records;37.8
0;0;1;0;0;around 20k records of text data.;153.7
0;0;0;0;0;of text data. So, how non-English;99.5
0;0;0;1;0;So, how non-English text will affect;119.1
0;0;0;0;1;text will affect your English text;97.5
0;0;0;0;0;your English text trained model? Pick;49.2
0;0;0;0;0;trained model? Pick any non-English text;79.3
0;0;0;0;0;any non-English text and pass it;107.7
0;1;0;0;1;and pass it through as input;117.3
0;0;0;0;0;through as input to your English;151.1
0;0;0;0;0;to your English text trained classification;47.3
0;0;0;0;0;text trained classification model. You will;129.3
0;0;0;0;0;model. You will come to know;135.1
0;0;0;0;0;come to know that the category;145.8
0;0;0;0;1;that the category is assigned to;131.9
1;0;0;1;0;is assigned to non-English text by;43.7
1;0;0;0;0;non-English text by the model. If;67.1
1;0;0;0;0;the model. If your model is;105.3
0;0;0;1;0;your model is dependent on one;65.2
0;1;0;0;0;dependent on one language then, other;98.3
0;0;0;0;0;language then, other languages in your;130.5
0;0;0;0;0;languages in your textual data should;107.2
0;1;1;0;0;textual data should be considered as;66.5
0;0;0;1;0;be considered as noise. But why?;43.1
0;0;0;0;1;noise. But why? The job of;56.7
0;0;0;0;0;The job of the text classification;75.1
1;0;0;0;0;the text classification model is to;88.3
1;0;0;0;0;model is to classify. And, it;91.3
0;0;0;0;0;classify. And, it will do its;106.4
1;0;0;0;0;will do its job despite its;109.5
0;0;0;0;1;job despite its input text will;143.1
0;0;0;0;0;input text will be in English;54.1
1;0;0;0;0;be in English or not. What;96.4
0;0;0;1;0;or not. What can we do;133.8
0;0;0;0;0;can we do to avoid such;146.4
0;0;1;0;0;to avoid such a situation? Your;164.3
0;0;1;0;0;a situation? Your model will not;34.6
0;0;0;0;0;model will not stop classifying the;76.8
0;0;0;1;0;stop classifying the non-English text. So,;80.5
0;0;1;0;0;non-English text. So, you have to;90.3
0;0;0;0;0;you have to detect the non-English;68.3
0;0;0;0;0;detect the non-English text and remove;44.0
0;0;1;0;0;text and remove it from trained;100.4
0;0;0;0;0;it from trained data and prediction;117.4
0;0;0;0;1;data and prediction data. This process;85.4
0;1;0;0;0;data. This process comes under the;65.7
0;0;1;0;0;comes under the data cleaning part.;54.3
0;1;0;0;0;data cleaning part. Inconsistency in your;78.9
0;0;0;0;0;Inconsistency in your data will result;96.8
1;0;0;0;1;data will result in a decrease;108.1
0;0;0;0;0;in a decrease in the accuracy;145.7
1;0;0;0;0;in the accuracy of the model.;103.6
0;0;1;0;0;of the model. Sometimes, multiple languages;56.4
0;0;0;0;1;Sometimes, multiple languages present in text;90.5
0;0;0;0;0;present in text data could be;80.4
0;0;0;0;0;data could be one of the;90.7
1;0;0;0;0;one of the reasons your model;48.8
0;0;0;0;0;reasons your model behaves strangely. So,;65.4
0;0;1;0;0;behaves strangely. So, in this article,;107.5
0;0;0;0;0;in this article, we will discuss;143.2
0;0;0;0;0;we will discuss the different python;165.0
0;0;0;0;0;the different python libraries which detect;123.3
0;0;0;0;1;libraries which detect the language(s) of;85.3
0;0;0;0;0;the language(s) of the text data.;91.4
0;0;0;0;1;the text data. Let’s start with;49.5
0;0;0;0;0;Let’s start with the spaCy library.;76.3
0;0;0;0;0;the spaCy library.;49.5
sample_df = pd.read_csv(io.StringIO(data_string), sep=';')
You can use remainder='passthrough' to avoid transforming already processed columns (therefore in your case you can just consider the binary columns as residual columns that your ColumnTransformer object won't process, but on which it will pass through). Then you should be aware that CountVectorizer expects a 1D array as input and therefore you should specify the columns to be passed to make_column_transformer as a string ('Text'), rather than as an array (['Text']) (see reference from make_column_transformer() doc).
columns : str, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
categorical_cols = [col for col in sample_df.columns if col.startswith('Level')]
textual_col = ['Text']
pipeline = Pipeline([
('transformer', make_column_transformer((CountVectorizer(), 'Text'),
('model', RandomForestRegressor())
X = sample_df[textual_col + categorical_cols]
y = sample_df['Value']
cv = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(pipeline, X, y, cv=cv)

OneHotEncoding after LabelEncoding

In Sklearn how can I do OneHotEncoding after LabelEncoding in Sklearn.
What i have done so far is that i mapped all the string features of my dataset like such.
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()
After that i applied this to the dataset columns, with indexing like such:
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
My results were not super good, so what I want to do, is that I want to use ÒneHotEncoding to see if performance is improved.
This is my code:
ohe = OneHotEncoder(categorical_features = categorical_cols)
X[categorical_cols] = ohe.fit_transform(df).toarray()
I have tried different approaches, but what i try to accomplish here is using the OneHotEncoding technique to overwrite the features.
OneHotEncoder directly supports categorical features, so no need to use a LabelEncoder prior to using it. Also note, that you should not use a LabelEncoder to encode features. Check LabelEncoder for features? for a detailed explanation on this. A LabelEncoder only makes sense on the actual target here.
So select the categorical columns (df.select_dtypes is normally used here), and fit on the specified columns. Here's a sketch one how you could proceed:
# OneHot encoding categorical columns
oh_cols = df.select_dtypes('object').columns
X_cat = df[oh_cols].to_numpy()
oh = OneHotEncoder()
one_hot_cols =
Then just call the transform method of the encoder. If you wanted to reconstruct the dataframe (as your code suggests) get_feature_names will give you the category names of the categorical features:
df_prepr = pd.DataFrame(one_hot_cols.transform(X_cat).toarray(),

SVC (support vector classification) with categorical (string) data as labels

I use scikit-learn to implement a simple supervised learning algorithm. In essence I follow the tutorial here (but with my own data).
I try to fit the model:
clf = svm.SVC(gamma=0.001, C=100.),labels_training)
But at the second line, I get an error: ValueError: could not convert string to float: 'A'
The error is expected because label_training contains string values which represent three different categories, such as A, B, C.
So the question is: How do I use SVC (support vector classification), if the labelled data represents categories in form of strings. One intuitive solution to me seems to simply convert each string to a number. For instance, A = 0, B = 1, etc. But is this really the best solution?
Take a look at section 4.3.4 Encoding categorical features.
In particular, look at using the OneHotEncoder. This will convert categorical values into a format that can be used by SVM's.
you can try this code:
from sklearn import svm
X = [[0, 0], [1, 1],[2,3]]
y = ['A', 'B','C']
clf = svm.SVC(gamma=0.001, C=100.), y)
You should take the dependent variable (y) as 'list'.
Please. No use encoding categorical in SVC. This algoritm work only with continuos variables. "Such integer representation can, however, not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily)"

How to apply pyspark-mllib-kmeans to categorical variables

There is a huge data file consisting of all categorical columns. I need to dummy code the data before applying kmeans in mllib. How is this doable in pySpark?
Thank you
Well, technically it is possible. Spark, including PySpark, provides a number of transformers which we can be used to encode categorical data. In particular you should take a look at the ml.feature.StringIndexer and OneHotEncoder.
from import OneHotEncoder, StringIndexer
df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["label", "feature"])
stringIndexer = StringIndexer(inputCol="feature", outputCol="indexed")
model =
indexed = model.transform(df)
encoder = OneHotEncoder(inputCol="indexed", outputCol="encoded")
encoded = encoder.transform(indexed)
So far so good. Problem is that categorical variables are not very useful in case of k-means. It assumes Euclidean norm which, even after encoding, is rather meaningless for categorical data.

How to apply a binary classifier in Scikit learn when attributes are string (not int or float)

I have a list of first and last name of people with a binary language class (speak English or not).
Here is a sample file (I changed the names with dummy values to keep the privacy of people):
I wanted to apply machine learning algorithms such as SVM and Naive Bayes using Scikit learn to evaluate a binary classification task. Since scikit does not let the attributes to be string, I transformed them to integers. The transformed sample file is like this:
I wanted to ask if SVM and Naive Bayes consider the input value of first and last names as independent values or there is some relation between numbers? in other words, is it important that 5 is greater than 2, or the numbers are just going to be considered as unique values regardless of their arithmetic value.
The reason for this question is that if I order the list by language(i.e. English speakers first) and then replace the names with integers, the algorithm gives me very good results(accuracy and f score above 97%). But if I shuffle the list and then replace names with integers, the results will be poor.
In general, what is the solution to do a classification using Scikit, when attribute values are strings.
P.S.1: I tested the same dataset with Weka and I didn't have such a problem because Weka uses arff files and it does necessary conversions itself.
P.S.2: Here is the code that I am using to read the file and apply the algorithm (works fine with no error)
#read file into numpy array format
path = "/path/to/csv/file/BinaryClassification.csv"
import numpy as np
lstAttributes = np.loadtxt(path, delimiter=',')[:,0:2]
lstLabels = np.loadtxt(path, delimiter=',')[:,2:3]
tempArr = []
for v in lstLabels:
from numpy import array
lstLabels = array(tempArr)
#trains and test algorithms (uses whole data as training and test set)
from sklearn import naive_bayes
classifier = naive_bayes.GaussianNB()
model =, lstLabels)
prediction = model.predict(lstAttributes)
from sklearn.metrics import confusion_matrix
print confusion_matrix(lstLabels, prediction)
#Use 5 fold cross validation to evaluate the algorithms
from sklearn import cross_validation
scores = cross_validation.cross_val_score(classifier, lstAttributes, lstLabels, cv=5, scoring='f1')
print("cross validation: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
In general, you have to know what the strings mean in order to convert them to numeric feature values, and you also have to consider which learning algorithm the result goes into. In this case, a one-hot encoding is probably the best thing to try first. DictVectorizer implements that. The result will be a sparse matrix of indicator variables, so you'd better switch from GaussianNB to BernoulliNB (not that GaussianNB makes sense for your current encoding).

