Does categorical variables needs to scaled before model building?
I have scaled all my continuous numerical variables using StandardScalear
now all the continues variables are between -1 and 1 where as categorical columns are binary.
How will it it affect my model?
Can someone please explain, how a scaled categorical variable will effect the splitting of nodes in the DecisionTreeClassifier
When you one-hot encode your categorical variables, the values in encoded variables become 0 and 1. Therefore, encoded variables will not negatively affect your model. The fact that you encode variables and pass them to ML learning algorithms is good, as you may gain additional insights from ML models.
When scaling your dataset, make sure you pay attention to 2 things:
Some ML algorithms require data to be scaled, and some do not. It is a good practice to only scale your data for models that are sensitive to un-scaled data, such as kNN.
There are different methods to scale your data. StandardScaler() is one of them, but it is vulnerable to outliers. Therefore, make sure you are using the scaling method that best fits your business needs. You can learn more about different scaling methods here: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
Encoded categorical variables contain values on 0 and 1. Therefore, there is even no need to scale them. However, scaling methods will be applied to them when you choose to scale your entire dataset prior to using your data with scale-sensitive ML models.
Related
I have a dataset with numerical and categorical data. The data includes outliner, which are essential for interpretation later. I’ve binary encoded the categorical data and used the RobustScaler on the numerical data.
The categorical binary encoded data does not get scaled. Is this combination possible or is there a logical error?
There's no reason why you couldn't do that, but there's also no point.
The reason why you scale input features to be on roughly the same scale is that lots of inference methods get tripped up by features which are on vastly different scales. See Why does feature scaling improve the convergence speed for gradient descent? for more.
A binary feature which ranges from 0 to 1 and a continuous feature where the 25-75% percentile range from -1 to 1 are already on approximately the same scale.
Since a binary feature is easier to interpret than a scaled binary feature, I would just leave it and not apply another scaling method.
I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.
I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)
You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).
I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.
Below is the data set from the UCI data repository. I want to build a regression model taking platelets count as the dependent variable(y) and the rest as features/inputs.
However, there are few categorical variables like such as anemia, sex, smoking, and DEATH_EVENT in the data set in numeric form.
My questions are:
Should I perform 'one-hot encoding' on these variables before building a regression model?
Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?
1.Should I perform 'one-hot encoding' on these variables before building a regression model?
Yup, you should one-hot encode the categorical variables. You can use like below:
columns_to_category = ['sex', 'smoking','DEATH_EVENT']
df[columns_to_category] = df[columns_to_category].astype('category') # change datetypes to category
df = pd.get_dummies(df, columns=columns_to_category) # One hot encoding the categories
2.If so, only one hot encoding is sufficient or should I perform even
label encoding?
One hot encoding should be sufficient I guess.
3.Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?
Yes you can use either StandardScaler() or MinMaxScaler() to get better results and then inverse scale the predictions. Also, make sure you scale the test and train separately and not combined because in real life your test will be not realized so yo need to scale accordingly to avoid such errors.
If those are truly binary categories, you don't have to one hot encode. They are already encoded.
You don't have to use one-hot encoding as those columns already have numerical values. Although if those numerical values are actually string instead of int or float then you should use one-hot encoding on them. About scaling the data, the variation is considerable, so you should scale it to avoid your regression model being biased towards high values.
I wish to fit a logistic regression model with a set of parameters. The parameters that I have include three distinct types of data:
Binary data [0,1]
Categorical data which has been encoded to integers [0,1,2,3,...]
Continuous data
I have two questions regarding pre-processing the parameter data before fitting a regression model:
For the categorical data, I've seen two ways to handle this. The first method is to use a one hot encoder, thus giving a new parameter for each category. The second method, is to just encode the categories with integers within a single parameter variable [0,1,2,3,4,...]. I understand that using a one hot encoder creates more parameters and therefore increases the risk of over-fitting the model; however, other than that, are there any reasons to prefer one method over the other?
I would like to normalize the parameter data to account for the large differences between the continuous and binary data. Is it generally acceptable to normalize the binary and categorical data? Should I normalize the categorical and the continuous parameters but not the binary parameters or can I just normalize all the parameter data types.
I realize I could fit this data with a random forest model and not have to worry much about pre-processing, but I'm curious how this applies with a regression type model.
Thank you in advance for your time and consideration.