I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).
Related
I have a case where I want to predict columns H1 and H2 which are continuous data with all categorical features in the hope of getting a combination of features that give optimal results for H1 and H2, but the distribution of the categories is uneven, there are some categories which only amount to 1,
Heres my data :
and my information of categories frequency in each column:
what I want to ask:
Does the imbalance of the features of the categories greatly affect the predictions? what is the right solution to deal with the problem?
How do you know the optimal combination? do you have to run a data test simulation predicting every combination of features with the created model?
What analytical technique is appropriate to determine the relationship between features on H1 and H2? So far I'm converting category data using one hot encoding and then calculating the correlation map
What ML model can be applied to my case? until now I have tried the RF, KNN, and SVR models but the RMSE score still high
What keywords that have similar cases and can help me to search for articles on google, this is my first time working on an ML/DS case for a paper.
thank you very much
A prediction based on a single observation won't be too reliable, of course. Binning rare categories into a sort of 'other' category is one common approach.
Feature selection is a vast topic (g: filter methods, embedded methods, wrapper methods). Personally I prefer studying mutual information and variance inflation factor first.
We cannot rely on Pearson's correlation when talking about categorical or binary features. The basic approach would be grouping your dataset by categories and comparing the target distributions for each one, running statistical tests perhaps to check whether the difference is significant. Also g: ANOVA, Kendall rank.
That said, preprocessing your data to get rid of useless or redundant features often yields much more improvement than using more complex models or hyperparameter tuning. Regardless, trying out gradient boosting models never hurts (catboost even provides a robust automatic handling of categorical features). ExtraTreesRegressor is less prone to overfitting than classic RF. Linear models should not be ignored either, especially ones like Lasso with embedded feature selection capability.
I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.
I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)
You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).
I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.
I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset
Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.
In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.
1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.
2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.
3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.
All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.
Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.
Try these things:
Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.
Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.
Try combining job type and key skills and then do vectorization, see how if it works better.
Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.
Hopefully, these will increase the performance of the model.
Let me know how is it performing with these.
I have a training feature set consisting of 92 features. Out of which 91 features are boolean values of 1 or 0. But 1 feature is numerical and it varies from 3-2000.
Will it be better if I do feature scaling on my 92nd feature?
If yes, what are the best possible ways to do it? I am using Python.
Sometimes, It is highly dependent on which algorithm you wanna use for your prediction. Suppose if you are using SVM and using Gaussian Kernel for that and you are not using feature scaling on your inputs then you might end up with wrong hypothesis and your large features will dominates over the other smaller features. Generally, feature scaling are always the best ways to control the variations in input and also it makes algorithm to compute fast (or in other words converge to the optimal minima).