Comparing feature importance in LightGBM + Scikit

Comparing feature importance in LightGBM + Scikit - python

I have a model trained using LightGBM (LGBMRegressor), in Python, with scikit-learn.
On a weekly basis the model in re-trained, and an updated set of chosen features and associated feature_importances_ are plotted. I want to compare these magnitudes along different weeks, to detect (abrupt) changes in the set of chosen variables and the importance of each of them. But I fear the raw importances are not directly comparable (I am using default split option, that gives importance to a feature based on the number of times it appears in the candidate models).
My question is: assuming these different raw importances along weeks are not directly comparable, is a normalization enough to allow the comparison? If yes, what is the best way to do such normalization? (maybe just a division by the highest importance of the week).
Thank you very much

Related

Continous data prediction with all categorical features analysis

I have a case where I want to predict columns H1 and H2 which are continuous data with all categorical features in the hope of getting a combination of features that give optimal results for H1 and H2, but the distribution of the categories is uneven, there are some categories which only amount to 1,
Heres my data :
and my information of categories frequency in each column:
what I want to ask:
Does the imbalance of the features of the categories greatly affect the predictions? what is the right solution to deal with the problem?
How do you know the optimal combination? do you have to run a data test simulation predicting every combination of features with the created model?
What analytical technique is appropriate to determine the relationship between features on H1 and H2? So far I'm converting category data using one hot encoding and then calculating the correlation map
What ML model can be applied to my case? until now I have tried the RF, KNN, and SVR models but the RMSE score still high
What keywords that have similar cases and can help me to search for articles on google, this is my first time working on an ML/DS case for a paper.
thank you very much

A prediction based on a single observation won't be too reliable, of course. Binning rare categories into a sort of 'other' category is one common approach.
Feature selection is a vast topic (g: filter methods, embedded methods, wrapper methods). Personally I prefer studying mutual information and variance inflation factor first.
We cannot rely on Pearson's correlation when talking about categorical or binary features. The basic approach would be grouping your dataset by categories and comparing the target distributions for each one, running statistical tests perhaps to check whether the difference is significant. Also g: ANOVA, Kendall rank.
That said, preprocessing your data to get rid of useless or redundant features often yields much more improvement than using more complex models or hyperparameter tuning. Regardless, trying out gradient boosting models never hurts (catboost even provides a robust automatic handling of categorical features). ExtraTreesRegressor is less prone to overfitting than classic RF. Linear models should not be ignored either, especially ones like Lasso with embedded feature selection capability.

How can you predict a combination of categorical and continuous variables with Scikit learn?

I have a dataset with a large number of predictive variables and I want to use them to predict a number of output variables. However, some of the things I want to predict are categorical, and others are continuous; the things I want to predict are not independent. Is it possible with scikit-learn to, for example, mix a classifier and a regressor so that I can predict and disentangle these variables? (I'm currently looking at gradient boosting classifiers/regressors, but there may be better options.)

You can certainly use One Hot Encoding or Dummy Variable Encoding, to convert labels to numerics. See the link below for all details.
https://codefires.com/how-convert-categorical-data-numerical-data-python/
As an aside, Random Forest is a popular machine learning model that is commonly used for classification tasks as can be seen in many academic papers, Kaggle competitions, and blog posts. In addition to classification, Random Forests can also be used for regression tasks. A Random Forest’s nonlinear nature can give it a leg up over linear algorithms, making it a great option. However, it is important to know your data and keep in mind that a Random Forest can’t extrapolate. It can only make a prediction that is an average of previously observed labels. In this sense it is very similar to KNN. In other words, in a regression problem, the range of predictions a Random Forest can make is bound by the highest and lowest labels in the training data. This behavior becomes problematic in situations where the training and prediction inputs differ in their range and/or distributions. This is called covariate shift and it is difficult for most models to handle but especially for Random Forest, because it can’t extrapolate.
https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f
https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn
In closing, Scikit-learn uses numpy matrices as inputs to its models. As such all features become de facto numerical (if you have categorical feature you’ll need to convert them to numerical).

I don't think there's a builtin way. There are ClassifierChain and RegressorChain that allow you to use earlier predictions as features in later predictions, but as the names indicate they assume either classification or regression. Two options come to mind:
Manually patch those together for what you want to do. For example, use a ClassifierChain to predict each of your categorical targets using just the independent features, then add those predictions to the dataset before training a RegressorChain with the numeric targets.
Use those classes as a base for defining a custom estimator. In that case you'll probably look mostly at their common parent class _BaseChain. Unfortunately that also uses a single estimator attribute, whereas you'd need (at least) two, one classifier and one regressor.

Getting confidence intervals from an Xgboost fitted model

I am trying to get the confidence intervals from an XGBoost saved model in a .tar.gz file that is created using python XGBoost library.
The problem is that the model has already been fitted, and I dont have training data any more, I just have inference or serving data to predict. All the examples that I found entail using a training and test data to create either quantile regression models, or bagged models, but I dont think I have the chance to do that.

Why your desired approach will not work
I assume we are talking about regression here. Given a regression model that you cannot modify, I think you will not be able to achieve your desired result using only the given model. The model was trained to calculate a continuous value that appoximates some objective value (i.e., its true value) based on some given input. Nothing more.
Possible solution
The only workaround I can think of would be to train two more models. These model's training goal would be to predict the quality of the output of your given model. One would calculate the upper bound of a given (i.e., predefined by you at training time) confidence interval and the other one the lower bound. This would probably include a lot of feature engineering. One would probably like to find features that correlate with the prediction quality of the original model.

Random Forest or other machine learning techniques [need advice]

I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.

"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_ or better the SHAP-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.

My idea is as follows. Your outcome variable 'Y' has only a few possible values. You can build a classifier (Random Forest is one of many existing classifiers), to predict say 'Y in [25-94,95-105,106-150]'. You will here have three different outcomes that rule out each other. (Other interval limits than 95 and 105 are possible, if that better suits your application).
Some of your predictive variables are time series whereas others are constant, as you explain. You should use a sliding window technique where your classifier predicts 'Y' based on the time-related variables in say the month January. It doesn't matter that some variables are constant, as the actual variable 'City' has the four outcomes: '[City_1,City_2,City_3,City_4]'. Similarly, use 'Population' and 'Age_mean' as the actual variables.
Once you use classifiers, many approaches to feature ranking and feature selection have been developed. You can use a web service like insight classifiers to do it for you, or download a package like Weka for that.
Key point is that you organize your model and its predictive variables such that a classifier can learn correctly.

If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.

Grouping low frequency levels of categorical variables to improve machine learning performance

I'm trying to find ways to improve performance of machine learning models either binary classification, regression or multinomial classification.
I'm now looking at the topic categorical variables and trying to combine low occuring levels together. Let's say a categorical variable has 10 levels where 5 levels account for 85% of the total frequency count and the 5 levels remaining account for the 15% remaining.
I'm currently trying different thresholds (30%, 20%, 10%) to combine levels together. This means I combine together the levels which represent either 30%, 20% or 10% of the remaining counts.
I was wondering if grouping these "low frequency groups" into a new level called "others" would have any benefit in improving the performance.
I further use a random forest for feature selection and I know that having fewer levels than orignally may create a loss of information and therefore not improve my performance.
Also, I tried discretizing numeric variables but noticed that my performance was weaker because random forests benefit from having the hability to split on their preferred split point rather than being forced to split on an engineered split point that I would have created by discretizing.
In your experience, would grouping low occuring levels together have a positive impact on performance ? If yes, would you recommend any techniques ?
Thank you for your help !

This isn't a programming question... By having fewer classes, you inherently increase the chance of randomly predicting the correct class.
Consider a stacked model (two models) where you have a primary model to classify between the overrepresented classes and the 'other' class, and then have a secondary model to classify between the classes within the 'other' class if the primary model predicts the 'other' class.

I have encountered similar question and this what I managed to find to this point:
Try numerical feature encoding on high cardinality categorical features.
There is very good library for that: category_encoders. And also medium article describing some of the methods.
Sources
https://www.kaggle.com/general/16927
https://stats.stackexchange.com/questions/411767/encoding-of-categorical-variables-with-high-cardinality
https://datascience.stackexchange.com/questions/37819/ml-models-how-to-handle-categorical-feature-with-over-1000-unique-values?newreg=f16addec41734d30bb12a845a32fe132
Cut features with low frequencies as you proposed initially. Threshold is arbitrary and usually is 1-10% of train samples. scikit-learn has option for this in it's OneHotEncoder transformer.
Sources
https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b
https://www.kaggle.com/general/16927

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.