Can I use RandomUnderSampler for categorical data as well? - python

AFAIK, unlike SMOTE, RandomUnderSampler selects a subset of the data. But I am not quite confident to use it for categorical data.
So, is it really applicable for categorical data?

Under/Over sampling has nothing to do with features. It relies on targets and under/oversamples majority/minority class, no matter whatheter it is composed of continuous variables, categorical ones, or elephants :)

Related

Encoding large number of categorical features [duplicate]

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

Feature Selection from Mixed dataset

I am a newbie in data science domain.
I have a data set, which has both numerical and string data.The interesting fact is both type of data make sense for the outcome. How to choose the relevant features from the data set?
Should I be using the LabelEncoder and convert the data from string to numerical and continue with the correlation? I am taking the right path? Is there any better way to solve this crisis?
You can encode categorical variables with label encoding if there is a meaningful ordering of available values and making sure the ordering is retained in the encoding. See here for an example.
If there's no ordering (or resolving a meaningful one is too much work) you can use one-hot encoding. This, however will increase the feature set proportionally to the distinct values for the feature in the dataset.
If one-hot results in a very large feature set and the categorical string data are natural language words, you may want to use a pretrained embedding.
Either way, you can then concatenate the encoded categorical column(s) to the continuous feature set and proceed with learning and feature selection.
Kind of a cop out but you could simply use a random forest and happily mix numerical and categorical data. Encoding with LabelEncoder on OneHotEncoding would allow you to use a wider variety of algorithms.

Can the input for "Lasso" in python contain categorical variables?

I want to perform multiple linear regression in python with lasso. I am not sure whether the input observation matrix X can contain categorical variables. I read the instructions from here: lasso in python
But it is simple and not indicate the types allowed for. For example, my code includes:
model = Lasso(fit_intercept=False, alpha=0.01)
model.fit(X, y)
In the code above, X is an observation matrix with size of n-by-p, can one of the p variables be categorical type?
You need to represent the categorical variables using 1s and 0s. If your categorical variables are binary, meaning each belongs to one of two categories, then you replace all category A and B variables into 0s and 1s, respectively. If some have more than two categories, you will need to use dummy variables.
I usually have my data in a Pandas dataframe, in which case I use houses = pd.get_dummies(houses), which creates the dummy variables.
A previous poster has a good answer for this, you need to encode your categorical variables. The standard way is one hot encoding (or dummy encoding), but there are a many methods for doing this.
Here is a good library that has many different ways you can encode your categorical variables. These are also implemented to work with Sci-kit learn.
https://contrib.scikit-learn.org/categorical-encoding/

Dealing with missing values

I have two data sets, training and test set.
If I have NA values in the training set but not in the test set, I usually drop the rows (if they are few) in the training set and that's all.
But now, I got a lot of NA values in both sets, so I have dropped the features which got lot most of NA values, and I was wondering what to do now.
Should I just drop the same features in the test set and impute the rest missing values?
Is there any other technique I could use to preprocess the data?
Can Machine Learning algorithms like Logistic Regression, Decision Trees or Neural Netwroks handle missing values?
The data sets come from a Kaggle competition so I can't do the preprocessing before splitting the data
Thanks in advance
This question is not so easy to answer, because it depends on the type of NA values.
Are the NA values due to some random reason? Or is there a reason they are missing (no matching multiple choice answer in a survey or maybe something people would not like to answer)
For the first, it would be fine to use a simple imputation strategy, so that you can fit your model on the data. Thereby, I mean something like mean imputation or sampling from an estimated probability distribution. Or even sampling values at random. Note, that if you simply take the mean of the existing values, you change the statistics of the dataset, i.e. you reduce the standard deviation. You should keep that in mind when choosing your model.
For the second, you will have to apply you domain knowledge to find good fill values.
Regarding your last question: if you want to fill the values with a machine learning model, you may use the other features of the dataset and implicitly assume a dependency between the missing feature and the other features. Depending on the model you will later use for prediction, you may not benefit from the intermediate estimation.
I hope this helps, but the correct answer really depends on the data.
In general, machine learning algorithms do not cope well with missing values (for mostly good reasons, as it is not known why they are missing or what it means to be missing, which could even be different for different observations).
Good practice would be to do the preprocessing before the split between training and test sets (are your training and test data truly random subsets of the data, as they should be?) and ensure that both sets are treated identically.
There is a plethora of ways to deal with your missing data and it depends strongly on the data, as well as on your goals, which are the better ways. Feel free to get in contact if you need more specific advice.

Mixed parameter types for machine learning

I wish to fit a logistic regression model with a set of parameters. The parameters that I have include three distinct types of data:
Binary data [0,1]
Categorical data which has been encoded to integers [0,1,2,3,...]
Continuous data
I have two questions regarding pre-processing the parameter data before fitting a regression model:
For the categorical data, I've seen two ways to handle this. The first method is to use a one hot encoder, thus giving a new parameter for each category. The second method, is to just encode the categories with integers within a single parameter variable [0,1,2,3,4,...]. I understand that using a one hot encoder creates more parameters and therefore increases the risk of over-fitting the model; however, other than that, are there any reasons to prefer one method over the other?
I would like to normalize the parameter data to account for the large differences between the continuous and binary data. Is it generally acceptable to normalize the binary and categorical data? Should I normalize the categorical and the continuous parameters but not the binary parameters or can I just normalize all the parameter data types.
I realize I could fit this data with a random forest model and not have to worry much about pre-processing, but I'm curious how this applies with a regression type model.
Thank you in advance for your time and consideration.

Categories

Resources