I have a case where I want to predict columns H1 and H2 which are continuous data with all categorical features in the hope of getting a combination of features that give optimal results for H1 and H2, but the distribution of the categories is uneven, there are some categories which only amount to 1,
Heres my data :
and my information of categories frequency in each column:
what I want to ask:
Does the imbalance of the features of the categories greatly affect the predictions? what is the right solution to deal with the problem?
How do you know the optimal combination? do you have to run a data test simulation predicting every combination of features with the created model?
What analytical technique is appropriate to determine the relationship between features on H1 and H2? So far I'm converting category data using one hot encoding and then calculating the correlation map
What ML model can be applied to my case? until now I have tried the RF, KNN, and SVR models but the RMSE score still high
What keywords that have similar cases and can help me to search for articles on google, this is my first time working on an ML/DS case for a paper.
thank you very much
A prediction based on a single observation won't be too reliable, of course. Binning rare categories into a sort of 'other' category is one common approach.
Feature selection is a vast topic (g: filter methods, embedded methods, wrapper methods). Personally I prefer studying mutual information and variance inflation factor first.
We cannot rely on Pearson's correlation when talking about categorical or binary features. The basic approach would be grouping your dataset by categories and comparing the target distributions for each one, running statistical tests perhaps to check whether the difference is significant. Also g: ANOVA, Kendall rank.
That said, preprocessing your data to get rid of useless or redundant features often yields much more improvement than using more complex models or hyperparameter tuning. Regardless, trying out gradient boosting models never hurts (catboost even provides a robust automatic handling of categorical features). ExtraTreesRegressor is less prone to overfitting than classic RF. Linear models should not be ignored either, especially ones like Lasso with embedded feature selection capability.
Related
I have a set of alphanumeric categorical features (c_1,c_2, ..., c_n) and one numeric target variable (prediction) as a pandas dataframe. Can you please suggest to me any feature selection algorithm that I can use for this data set?
I'm assuming you are solving a supervised learning problem like Regression or Classification.
First of all I suggest to transform the categorical features into numeric ones using one-hot encoding. Pandas provides an useful function that already does it:
dataset = pd.get_dummies(dataset, columns=['feature-1', 'feature-2', ...])
If you have a limited number of features and a model that is not too computationally expensive you can test the combination of all the possible features, it is the best way however it is seldom a viable option.
A possible alternative is to sort all the features using the correlation with the target, then sequentially add them to the model, measure the goodness of my model and select the set of features that provides the best performance.
If you have high dimensional data, you can consider to reduce the dimensionality using PCA or another dimensionality reduction technique, it projects the data into a lower dimensional space reducing the number of features, obviously you will loose some information due to the PCA approximation.
These are only some examples of methods to perform feature selection, there are many others.
Final tips:
Remember to split the data into Training, Validation and Test set.
Often data normalization is recommended to obtain better results.
Some models have embedded mechanism to perform feature selection (Lasso, Decision Trees, ...).
I have 2 doubts :
If we have a classification problem with a dataframe that has large no of features (columns > 100) and if say 20/30 of them are highly correlated and the target columns (y) is very skewed towards one class ;
should we first remove the imbalance using Imblearn or should we drop the highly correlated columns ?
In a classification problem should we first standardise the data or handle the outliers ?
There is no "true" answer to your questions - the approach to take highly depends on your setting, the models you apply and the goals at hand.
The topic of class imbalance has been discussed elsewhere (for example here and here).
A valid reason for oversampling/undersampling your positive or negative class training examples could be the knowledge that the true incidence of positive instances is higher (lower) than your training data suggests. Then you might want to apply sampling techniques to achieve a positive/negative class balance that matches that prior knowledge.
While not really dealing with imbalance in your label distribution, your specific setting may warrant assigning different costs to false positives and false negatives (e.g. the cost of misclassifying a cancer patient as healthy may be higher than vice versa). This you can deal with by e.g. adapting your cost function (e.g. a false negative incurring higher cost than a false negative) or by performing some kind of threshold optimization after training (to e.g. reach a certain precision/recall in cross-validation).
The problem of highly correlated features occurs with models that assume that there is no correlation between features. For example, if you have an issue with multicollinearity in your feature space, parameter estimates in logistic regressions may be off. Whether or not there is multicollinearity you can for example check using the variance inflation factor (VIF). However, not all models carry such an assumption, so you might be save disregarding the issue depending on your setting.
Same goes for standardisation: This may not be necessary (e.g. tree classifiers), but other method may require it (e.g. PCA).
Whether or not to handle outliers is a difficult question. First you would have to define what an outlier is - are they e.g. a result of human error? Do you expect seeing similar instances out in the wild? If you can establish that your model performs better if you train it with outliers removed (on a holdout validation or test set), then: sure, go for it. But keep potential outliers in for validation if you plan to apply your model on streams of data which may produce similar outliers.
I'm trying to find ways to improve performance of machine learning models either binary classification, regression or multinomial classification.
I'm now looking at the topic categorical variables and trying to combine low occuring levels together. Let's say a categorical variable has 10 levels where 5 levels account for 85% of the total frequency count and the 5 levels remaining account for the 15% remaining.
I'm currently trying different thresholds (30%, 20%, 10%) to combine levels together. This means I combine together the levels which represent either 30%, 20% or 10% of the remaining counts.
I was wondering if grouping these "low frequency groups" into a new level called "others" would have any benefit in improving the performance.
I further use a random forest for feature selection and I know that having fewer levels than orignally may create a loss of information and therefore not improve my performance.
Also, I tried discretizing numeric variables but noticed that my performance was weaker because random forests benefit from having the hability to split on their preferred split point rather than being forced to split on an engineered split point that I would have created by discretizing.
In your experience, would grouping low occuring levels together have a positive impact on performance ? If yes, would you recommend any techniques ?
Thank you for your help !
This isn't a programming question... By having fewer classes, you inherently increase the chance of randomly predicting the correct class.
Consider a stacked model (two models) where you have a primary model to classify between the overrepresented classes and the 'other' class, and then have a secondary model to classify between the classes within the 'other' class if the primary model predicts the 'other' class.
I have encountered similar question and this what I managed to find to this point:
Try numerical feature encoding on high cardinality categorical features.
There is very good library for that: category_encoders. And also medium article describing some of the methods.
Sources
https://www.kaggle.com/general/16927
https://stats.stackexchange.com/questions/411767/encoding-of-categorical-variables-with-high-cardinality
https://datascience.stackexchange.com/questions/37819/ml-models-how-to-handle-categorical-feature-with-over-1000-unique-values?newreg=f16addec41734d30bb12a845a32fe132
Cut features with low frequencies as you proposed initially. Threshold is arbitrary and usually is 1-10% of train samples. scikit-learn has option for this in it's OneHotEncoder transformer.
Sources
https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b
https://www.kaggle.com/general/16927
I am having a ML language identification project (Python) that requires a multi-class classification model with high dimension feature input.
Currently, all I can do to improve accuracy is through trail-and-error. Mindlessly combining available feature extraction algorithms and available ML models and see if I get lucky.
I am asking if there is a commonly accepted workflow that find a ML solution systematically.
This thought might be naive, but I am thinking if I can somehow visualize those high dimension data and the decision boundaries of my model. Hopefully this visualization can help me to do some tuning. In MATLAB, after training, I can choose any two features among all features and MATLAB will give a decision boundary accordingly. Can I do this in Python?
Also, I am looking for some types of graphs that I can use in the presentation to introduce my model and features. What are the most common graphs used in the field?
Thank you
Feature engineering is more of art than technique. That might require domain knowledge or you could try adding, subtracting, dividing and multiplying different columns to make features out of it and check if it adds value to the model. If you are using Linear Regression then the adjusted R-squared value must increase or in the Tree models, you can see the feature importance, etc.
I am wondering whether there exists some correlation among the hyperparameters of two different classifiers.
For example: let us say that we run LogisticRegression on a dataset with best hyperparameters (by finding through GridSearch) and want to run another classifier like SVC (SVM classifier) on the same dataset but instead of finding all hyperparameters using GridSearch, can we fix some values (or reduce range to limit the search space for GridSearch) of hyperparameters?
As an experimentation, I used scikit-learn's classifiers like LogisticRegression, SVS, LinearSVC, SGDClassifier and Perceptron to classifiy some well know datasets. In some cases, I am able to see some correlation empirically, but not always for all datasets.
So please help me to clear this point.
I don't think you can correlated different parameters of different classifiers together like this. This is mainly because each classifier behaves differently as it has it's own way of adjusting the data along their own set of equations. For example, take the case of SVC with two different kernels rbf and sigmoid. It might be the case that rbf may fit perfectly over the data with the intercept parameter C set to say 0.001, while 'sigmoidkernel over the same data may fit withC` value 0.00001. Both values may also be equal. However, you can never say that for sure. When you say that :
In some cases, I am able to see some correlation empirically, but not always for all datasets.
It may simply be a coincidence. Since it all depends on the and the classifiers. You cannot apply it globally.Correlation does not always equal to causation
You can visit this site and see for yourself that although different regressor functions have the same parameter a, their equations are vastly different and hence over the same dataset you might drastically different values of a.