What would be a good way to encode high cardinality variables? - python

I have received a dataset with postcodes and car brands, and the 'one-hot' encoding would make the number of variables way too large.
I have seen examples where they transform the column with the high cardinality variable into the number of times each feature is in the column, but in the same article they said this technique was not advisable for post codes, is this true?

Related

Building machine learning with a dataset which has only string values

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!
This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

How and when to deal with outliers in your dataset (general strategy)

I stumbled about the following problem:
I'm working on a beginners project in data science. I got my test and train data splits and right now I'm analyzing every feature, then adding it to either a dataframe for discretised continuous variables or a dataframe for continuous variables.
Doing so I encountered a feature with big outliers. If I would to delete them, other features I already added to my sub dataframes would have more column entries than this one.
Should I just find a strategy to overwrite the outliers with "better" values or should I reconsider my strategy to split the train data for both types of variables in the beginning? I don't think that
getting rid of the outlier rows in the real train_data would be useful though...
There are many ways to deal with outliers.
In my cours for datascience we used "data imputation":
But before you start to replace or remove data, its important to analyse what difference the outlier makes and if the outlier is valid ofcours.
If the outlier is invalid, you can delete the outlier and use data imputation as explained below.
If your outlier is valid, check the differnce in outcome with and without the outlier. If the difference is very small then there ain't a problem. If the differnce is significant you can use standardization and normalization.
You can replace the outlier with:
a random value (not recommended)
a value based on hueristic logic
a value based on its neighbours
the median, mean or modus.
a value based on interpolation (making a prediction with a certain ml model)
I recommend using the strategy with the best outcome.
Statquest explains datascience and machinelearning concepts in a very easy and understandable way, so refer to him if you encounter more theoritical questions: https://www.youtube.com/user/joshstarmer

Encoding Categorical Variables like "State Names"

I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type.
There are 83 unique State Names.
Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names.
Is there anything else I can try?
I would use scikit's OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or CategoricalEncoder with encoding set to 'onehot'. It automatically finds the unique values for each feature and processes it into a one hot vector. It does increase the input dimensionality for that feature, but it is necessary if you are doing any type of data science work. If you convert the feature to an ordinal integer (i.e. only one integer) as opposed to a vector of binary values, an algorithm may draw incorrect conclusions between two (possibly completely separate) categorical values that just happen to be close together in the categorical space.
There are other powerful encoding schemes beside one hot, which do not increase number of columns . You can try the following (in increasing order of complexity):
count encoding: encode each category by the number of times it occurs in data, useful in some cases. For example, if you want to encode the information that New York is a big city, the count of NY in data really contains that info, as we expect that NY will occur frequently.
target encoding: encode each category by the average value of target/outcome (if target is continuous) within that category; or by the probability of target if it is discrete. An example is when you want to encode neighborhood, which is obviously important for predicting house price; definitely, you can replace each neighborhood name by the average house price in that neighborhood. This improves prediction incredibly (as shown in my Kaggle notebook for predicting house price).
There are still other useful encoding schemes like Catboost, weight of evidence, etc. A really nice thing is all these schemes are already implemented in library categorical encoder here.

Prediction based on more dataframes

I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.

Should I drop a variable that has the same value in the whole column for building machine learning models?

For instance, column x has 50 values and all of these values are the same.
Is it a good idea to delete variables like these for building machine learning models? If so, how can I spot these variables in a large data set?
I guess a formula/function might be required to do so. I am thinking of using nunique that can take account of the whole dataset.
You should be deleting such columns because it will provide no extra information about how each data point is different from another. It's fine to leave the column for some machine learning models (due to the nature of how the algorithms work), like random forest, because this column will actually not be selected to split the data.
To spot those, especially for categorical or nominal variables (with fixed number of possible values), you can count the occurrence of each unique value, and if the mode is larger than a certain threshold (say 95%), then you delete that column from your model.
I personally will go through variables one by one if there aren't any so that I can fully understand each variable in the model, but the above systematic way is possible if the feature size is too large.

Categories

Resources