LightGBM 'class_weight' parameter: To use with Binary Classification or not? - python

When dealing with class imbalance issue, penalizing the majority class is a common practice that I have come across while building Machine Learning models. Hence, I often use class weights post re-sampling. LightGBM is one efficient decision tree based framework that is believed to handle class imbalance well. So I am using a LightGBM model for my binary classification problem. The dataset has high class imbalance in the ratio 34:1.
I initially used the LightGBM Classifier with 'class weights' parameter. However, the documentation of LightGBM Classifier mentions to use this parameter for multi-class problems only. For binary classification, it suggests using the 'is_unbalance' or 'scale_pos_weight' parameters. But, by using class weights I see better results and it is also easier to tune the weights and track performance of the model in comparison to when using the other two params.
But since the documentation recommends not to use it for Binary Classification, are there any repercussions of using the parameter? I am getting good results with it on my test data and validation data, but I wonder if it will behave otherwise on other real time data?

Documentation recommends alternative parameters:
Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html

Related

XGBoost for multiclassification and imbalanced data

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?
I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.
sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
You can use sample_weight as #Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)

One class SVM for concept drift

I am working on one class classification using python. I am currently using One class SVM as implemented in scikit-learn. My data will gradually change in due course of time and the older data will no longer be valid to be used for classification, which is known as concept drift as far as I am aware of. Hence I will have to retrain my classifier as soon as a new sample(or a predetermined number of samples) is added.
However, I have read that One class SVM does not work properly in case of concept drift. Why is it so?
Which other one class classification algorithm can I use if not OSVM in case of concept drift in my dataset?

Updating a fitted sklearn Random Forest Model

While running in production, is it possible to update a trained model with new data without re-fitting the model? I see you can use the warm_start parameter to enable adding trees to the model; however, I am looking for a way to update the existing trees with the incoming data.
As far as I can tell, this is not possible with sklearn (as they seem to implement the classical Breiman algorithm). However, you might have a look at Mondrian Forests (https://papers.nips.cc/paper/5234-mondrian-forests-efficient-online-random-forests.pdf, python implementation: https://github.com/balajiln/mondrianforest).

Incremental SVM With Probability Estimates

Is there any implementation of incremental svm which also has the feature of returning the probability of a given feature vector belonging to the various classes? Preferably usable with python code
I have heard about LaSVM. Does LaSVM has a feature of returning probability estimates? Also does it have features for handling imbalance training datasets?
You can have a look in Scikit Learn, a very flexible and efficient library written in Python
In every model, there are stored the internal calculated values. If clf is your SVM classifier, you can access clf.decision_function to see some explanation of the predictions.
It also provides a good set of tools for preprocessing data among other things you can find interesting.
cheers,
For getting probability estimate you can use scikit-learn library. There are 2 alternatives you can use. One gives probabilities. Here is an example: How to know what classes are represented in return array from predict_proba in Scikit-learn
And the other gives signed values for ranking (not probability but generally gives better result): Scikit-learn predict_proba gives wrong answers you should look at the answer.

Scikit-learn model parameters unavailable? If so what ML workbench alternative?

I am doing machine learning using scikit-learn as recommended in this question. To my surprise, it does not appear to provide access to the actual models it trains. For example, if I create an SVM, linear classifier or even a decision tree, it doesn't seem to provide a way for me to see the parameters selected for the actual trained model.
Seeing the actual model is useful if the model is being created partly to get a clearer picture of what features it is using (e.g., decision trees). Seeing the model is also a significant issue if one wants to use Python to train the model and some other code to actually implement it.
Am I missing something in scikit-learn or is there some way to get at this in scikit-learn? If not, what is the a good free machine learning workbench, not necessarily in python, in which models are transparently available?
The fitted model parameters are stored directly as attributes on the model instance. There is a specific naming convention for those fitted parameters: they all end with a trailing underscore as opposed to user-provided constructor parameters (a.k.a. hyperparameters) which don't.
The type of the fitted attributes is algorithm-dependent. For instance for a kernel Support Vector Machine you will have the arrays support vectors, dual coefs and intercepts while for random forests and extremly randomized trees you will have a collection of binary trees (internally represented in memory as contiguous numpy arrays for performance matters: structure of arrays representation).
See the Attributes section of the docstring of each model for more details, for instance for SVC:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
For tree based models you also have a helper function to generate a graphivz_export of the learned trees:
http://scikit-learn.org/stable/modules/tree.html#classification
To find the importance of features in forests models you should also have a look at the compute_importances parameter, see the following examples for instance:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#example-ensemble-plot-forest-importances-faces-py

Categories

Resources