I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type.
There are 83 unique State Names.
Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names.
Is there anything else I can try?
I would use scikit's OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or CategoricalEncoder with encoding set to 'onehot'. It automatically finds the unique values for each feature and processes it into a one hot vector. It does increase the input dimensionality for that feature, but it is necessary if you are doing any type of data science work. If you convert the feature to an ordinal integer (i.e. only one integer) as opposed to a vector of binary values, an algorithm may draw incorrect conclusions between two (possibly completely separate) categorical values that just happen to be close together in the categorical space.
There are other powerful encoding schemes beside one hot, which do not increase number of columns . You can try the following (in increasing order of complexity):
count encoding: encode each category by the number of times it occurs in data, useful in some cases. For example, if you want to encode the information that New York is a big city, the count of NY in data really contains that info, as we expect that NY will occur frequently.
target encoding: encode each category by the average value of target/outcome (if target is continuous) within that category; or by the probability of target if it is discrete. An example is when you want to encode neighborhood, which is obviously important for predicting house price; definitely, you can replace each neighborhood name by the average house price in that neighborhood. This improves prediction incredibly (as shown in my Kaggle notebook for predicting house price).
There are still other useful encoding schemes like Catboost, weight of evidence, etc. A really nice thing is all these schemes are already implemented in library categorical encoder here.
Related
I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!
This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here
I have received a dataset with postcodes and car brands, and the 'one-hot' encoding would make the number of variables way too large.
I have seen examples where they transform the column with the high cardinality variable into the number of times each feature is in the column, but in the same article they said this technique was not advisable for post codes, is this true?
I work in Python. I have a problem with the categorical variable - "city".
I'm building a predictive model on a large dataset-over 1 million rows.
I have over 100 features. One of them is "city", consisting of 33 000 different cities.
I use e.g. XGBoost where I need to convert categorical variables into numeric. Dummifying causes the number of features to increase strongly. XGBoost (and my 20 gb RAM) can't handle this.
Is there any other way to deal with this variable than e.g. One Hot Encoding, dummies etc.?
(When using One Hot Encoding e.g., I have performance problems, there are too many features in my model and I'm running out of memory.)
Is there any way to deal with this?
XGBoost has also since version 1.3.0 added experimental support for categorical encoding.
Copying my answer from another question.
Nov 23, 2020
XGBoost has since version 1.3.0 added experimental support for categorical features. From the docs:
1.8.7 Categorical Data
Other than users performing encoding, XGBoost has experimental support
for categorical data using gpu_hist and gpu_predictor. No special
operation needs to be done on input test data since the information
about categories is encoded into the model during training.
https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf
In the DMatrix section the docs also say:
enable_categorical (boolean, optional) – New in version 1.3.0.
Experimental support of specializing for categorical features. Do not
set to True unless you are interested in development. Currently it’s
only available for gpu_hist tree method with 1 vs rest (one hot)
categorical split. Also, JSON serialization format, gpu_predictor and
pandas input are required.
Other models option:
If you don't need to use XGBoost, you can use a model like LightGBM or or CatBoost which support categorical encoding without one-hot-encoding out of the box.
You could use some kind of embeddings that reflect better those cities (and compress the number of total features by direct OHE), maybe using some features to describe the continet where each city belongs, then some other features to describe the country/region, etc.
Note that since you didn't provide any specific detail about this task, I've used only geographical data on my example, but you could use some other variables related to each city, like the mean temprature, the population, the area, etc, depending on the task you are trying to address here.
Another approach could be replacing the city name with its coordinates (latitude and longitude). Again, this may be helpful depending on the task for your model.
Hope this helps
Beside the models, you could also decrease the number of the features (cities) by grouping them in geographical regions. Another option is grouping them by population size.
Another option is grouping them by their frequency by using quantile bins. Target encoding might be another option for you.
Feature engineering in many cases involves a lot of manual work, unfortunately you cannot always have everything sorted out automatically.
There are already great responses here.
Other technique I would use is cluster those cities into groups using K-means clustering with some of the features specific to cities in your dataset.
By this way you could use the cluster number in place of the actual city. This could reduce the number of levels quite a bit.
I am learning machine learning using Python and understand that I cannot run categorical data through the model, and must first get dummies. Some of my categorical data has nulls (a very small fraction of only 2 features). When I convert to dummies, then see if I have missing values it always shows none. Should I impute beforehand? Or do I impute categorical data at all? For instance if the category was male/female, I wouldn't want to replace nulls with the most_frequent. I see how this would make sense if the feature was income, and I was going to impute missing values. Income is income, whereas a male is not a female.
So does it make sense to impute categorical data? Am I way off? I am sorry this is more applied theory than actual Python programming but was not sure where to post this type of question.
I think the answers depends on the properties of your features.
Fill in missing data with expectation maximization (EM)
Say you have two features, one is gender (has missing data) and the other one is wage (no missing data). If there is a relationship between the two features, you could use information contained in the wage to fill in missing values in gender.
To put it a little bit more formally - if you have a missing value in gender column but you have a value for wage, EM tells you P(gender=Male | wage=w0, theta), i.e. the probability of the gender being male given wage=w0 and theta which is a parameter obtained with maximum likelihood estimation.
In simpler terms, this could be achieved by running regression of gender on wage (use logistic regression since the y-variable is categorical) to give you the probability described above.
Visually:
(these are totally add-hoc values but convey the idea that the wage distribution for males is generally above that for females)
Fill in missing values #2
You probably can fill in missing value using the most frequent observation if you believe that the data is missing at random even though there is no relationship between the two features. I would be cautious though.
Don't impute
If there is no relationship between the two features and you believe that the missing data might not be missing at random.
I am trying to predict the output of tennis matches - just a fun side project. Im using a random forest regressor to do this. now, one of the features is the ranking of the player before a specific match. for many matches I dont have a ranking (I only have the first 200 ranked). The question is - is it better to put a value that is not an integer, like for example the string "NoRank", or put an integer that is beyond the range of 1-200? Considering the learning algorithm, Im inclined to put the value 201, but I would like to hear your opinions on this..
Thanks!
scikit-learn random forests do not support missing values unfortunately. If you think that unranked players are likely to behave worst that players ranked 200 on average then inputing the 201 rank makes sense.
Note: all scikit-learn models expect homogeneous numerical input features, not string labels or other python objects. If you have string labels as features you first need to find the right feature extraction strategy depending on the meaning of your string features (e.g. categorical variable identifiers or free text to be extracted as a bag of words).
I will be careful with just adding 201 (or any other value) to the nonranked ones.
RF normalize the data (Do I need to normalize (or scale) data for randomForest (R package)?), which means it can group 200 with 201 in the split, or it might not. you are basically faking data that you do not have.
I will add another column: "haverank"
and use a 0/1 for it.
0 will be for people without rank
1 for people with rank.
call it "highrank" if the name sounds better.
you can also add another column named "veryhighrank"
and give the value 1 to all players between ranks 1-50. etc...