I want to perform multiple linear regression in python with lasso. I am not sure whether the input observation matrix X can contain categorical variables. I read the instructions from here: lasso in python
But it is simple and not indicate the types allowed for. For example, my code includes:
model = Lasso(fit_intercept=False, alpha=0.01)
model.fit(X, y)
In the code above, X is an observation matrix with size of n-by-p, can one of the p variables be categorical type?
You need to represent the categorical variables using 1s and 0s. If your categorical variables are binary, meaning each belongs to one of two categories, then you replace all category A and B variables into 0s and 1s, respectively. If some have more than two categories, you will need to use dummy variables.
I usually have my data in a Pandas dataframe, in which case I use houses = pd.get_dummies(houses), which creates the dummy variables.
A previous poster has a good answer for this, you need to encode your categorical variables. The standard way is one hot encoding (or dummy encoding), but there are a many methods for doing this.
Here is a good library that has many different ways you can encode your categorical variables. These are also implemented to work with Sci-kit learn.
https://contrib.scikit-learn.org/categorical-encoding/
Related
AFAIK, unlike SMOTE, RandomUnderSampler selects a subset of the data. But I am not quite confident to use it for categorical data.
So, is it really applicable for categorical data?
Under/Over sampling has nothing to do with features. It relies on targets and under/oversamples majority/minority class, no matter whatheter it is composed of continuous variables, categorical ones, or elephants :)
I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.
I am a newbie in data science domain.
I have a data set, which has both numerical and string data.The interesting fact is both type of data make sense for the outcome. How to choose the relevant features from the data set?
Should I be using the LabelEncoder and convert the data from string to numerical and continue with the correlation? I am taking the right path? Is there any better way to solve this crisis?
You can encode categorical variables with label encoding if there is a meaningful ordering of available values and making sure the ordering is retained in the encoding. See here for an example.
If there's no ordering (or resolving a meaningful one is too much work) you can use one-hot encoding. This, however will increase the feature set proportionally to the distinct values for the feature in the dataset.
If one-hot results in a very large feature set and the categorical string data are natural language words, you may want to use a pretrained embedding.
Either way, you can then concatenate the encoded categorical column(s) to the continuous feature set and proceed with learning and feature selection.
Kind of a cop out but you could simply use a random forest and happily mix numerical and categorical data. Encoding with LabelEncoder on OneHotEncoding would allow you to use a wider variety of algorithms.
Below is the data set from the UCI data repository. I want to build a regression model taking platelets count as the dependent variable(y) and the rest as features/inputs.
However, there are few categorical variables like such as anemia, sex, smoking, and DEATH_EVENT in the data set in numeric form.
My questions are:
Should I perform 'one-hot encoding' on these variables before building a regression model?
Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?
1.Should I perform 'one-hot encoding' on these variables before building a regression model?
Yup, you should one-hot encode the categorical variables. You can use like below:
columns_to_category = ['sex', 'smoking','DEATH_EVENT']
df[columns_to_category] = df[columns_to_category].astype('category') # change datetypes to category
df = pd.get_dummies(df, columns=columns_to_category) # One hot encoding the categories
2.If so, only one hot encoding is sufficient or should I perform even
label encoding?
One hot encoding should be sufficient I guess.
3.Also, I observe the values are in various ranges, so should I even scale the data set before applying the regression model?
Yes you can use either StandardScaler() or MinMaxScaler() to get better results and then inverse scale the predictions. Also, make sure you scale the test and train separately and not combined because in real life your test will be not realized so yo need to scale accordingly to avoid such errors.
If those are truly binary categories, you don't have to one hot encode. They are already encoded.
You don't have to use one-hot encoding as those columns already have numerical values. Although if those numerical values are actually string instead of int or float then you should use one-hot encoding on them. About scaling the data, the variation is considerable, so you should scale it to avoid your regression model being biased towards high values.
I am using a data-set to make some predictions using the multi-variable regression techniques. I have to predict the salary of the employees based on some independent variables like gender, percentage, date of birth, marks in different subjects, degree, specialization etc.
Numeric parameters(eg- marks and percentage in different subjects) are fine to be used with the regression model. But how do we normalize the non-numeric parameters (gender, date of birth, degree, specialization) here ?
P.S. : I am using the scikit-learn : machine learning in python package.
You want to encode your categorical parameters.
For binary categorical parameters such as gender, this is relatively easy: introduce a single binary parameter: 1=female, 0=male.
If there are more than two categories, you could try the one-hot-encoding.
Read more on the sci-kit learn documenten:
http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
Note that date is not a categorical parameter! Convert it into a unix timestamp (seconds since epoch) and you have a nice parameter on which you can regress.
"Normaliz[ing] non-numeric parameters" is actually a huge area of regression. The most common treatment is to turn each categorical into a set of binary variables called dummy variables.
Each categorical with n values should be converted into n-1 dummy variables. So for example, for gender, you might have one variable, "female", that would be either 0 or 1 at each observation. Why n-1 and not n? Because you want to avoid the dummy variable trap, where basically the intercept column of all 1's can be reconstructed from a linear combination of your dummy columns. In relatively non-technical terms, that's bad because it messes up the linear algebra needed to do the regression.
I am not so familiar with the scikit-learn library but I urge you to make sure that whatever methods you do use, you ensure that each categorical becomes n-1 new columns, and not n.
I hope this can help you. The whole description of how to use that function is available on this link.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html