I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
You can pass sample weights argument to Random Forest fit method
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any
single class carrying a negative weight in either child node.
In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
Update
Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score?
Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then?
I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
UPDATE
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y)).
Use the parameter class_weight='balanced'
From sklearn documentation: The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
If the majority class is 1, and the minority class is 0, and they are in the ratio 5:1, the sample_weight array should be:
sample_weight = np.array([5 if i == 1 else 1 for i in y])
Note that you do not invert the ratios.This also applies to class_weights. The larger number is associated with the majority class.
Related
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.
I'm trying to solve a binary classification problem where 80% of the data belongs to class x and 20% of the data belongs to class y. All my models (AdaBoost, Neural Networks and SVC) just predict all data to be part of class x as this is the highest accuracy they can achieve.
My goal is to achieve a higher precision for all entries of class x and I don't care how many entries are falsely classified to be part of class y.
My idea would be to just put entries in class x when the model is super sure about them and put them in class y otherwise.
How would I achieve this? Is there a way to move the treshold so that only very obvious entries are classified as class x?
I'm using python and sklearn
Sample Code:
adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_prediction = adaboost.predict(X_test)
confusion_matrix(adaboost_prediction,y_test) outputs:
array([[ 0, 0],
[10845, 51591]])
Using AdaBoostClassifier you can output class probabilities and then threshold them by using predict_proba instead of predict:
adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_probs = adaboost.predict_proba(X_test)
threshold = 0.8 # for example
thresholded_adaboost_prediction = adaboost_probs > threshold
Using this approach you could also inspect (just debug print, or maybe sort and plot on a graph) how the confidence levels vary in your final model on the test data to help decide whether it is worth taking further.
There is more than one way to approach your problem though. For example see Miriam Farber's answer which looks at re-weighting the classifier to adjust for your 80/20 class imbalance during training. You might find you have other problems, including perhaps the classifiers you are using cannot realistically separate x and y classes given your current data. Going through all possibilities of a data problem like this might take a few different approaches.
If you have more questions about issues with your data problem as opposed to the code, there are Stack Exchange sites that could help you as well as Stack Overflow (do read the site guidelines before posting): Data Science and Cross Validated.
In SVM, one way to move the threshold is to choose class_weight in such a way that you put much more weight on data points from class y. Consider the below example, taken from SVM: Separating hyperplane for unbalanced classes:
The straight line is the decision boundary that you get when you use SVC with default class weights (same weight for every class). The dashed line is the decision boundary that you get when you use class_weight={1: 10} (that is, put much more weight on class 1, relatively to class 0).
Class weights besically adjust the penalty parameter in SVM:
class_weight : {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not
given, all classes are supposed to have weight one. The “balanced”
mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples /
(n_classes * np.bincount(y))
I've been using python to experiment with sklearn's BayesianGaussianMixture (and with GaussianMixture, which shows the same issue).
I fit the model with a number of items drawn from a distribution, then tested the model with a held out data set (some from the distribution, some outside it).
Something like:
X_train = ... # 70x321 matrix
X_in = ... # 20x321 matrix of held out data points from X
X_out = ... # 20x321 matrix of data points drawn from a different distribution
model = BayesianGaussianMixture(n_components=1)
model.fit(X_train)
print(model.score_samples(X_in).mean())
print(model.score_samples(X_out).mean())
outputs:
-1334380148.57
-2953544628.45
The score_samples method returns a per-sample log likelihood of the given data, and "in" samples are much more likely than the "out" samples as expected - I'm just wondering why the absolute values are so high?
The documentation for score_samples states "Compute the weighted log probabilities for each sample" - but I'm unclear what the weights are based on.
Do I need to scale my input first? Is my input dimensionality too high? Do I need to do some additional parameter tuning? Or am I just misunderstanding what the method returns?
The weights are based on the mixture weights.
Do I need to scale my input first?
This is usually not a bad idea but I can't say not knowing more about your data.
Is my input dimensionality too high?
It seems given the amount of data you are fitting it actually is too high. Remember the curse of dimensionality. You have very few rows of data and 312 features, 1:4 ratio; that's not really going to work in practice.
Do I need to do some additional parameter tuning? Or am I just
misunderstanding what the method returns?
Your outputs are log-probabilites that are very negative. If you raise e to such a large negative magnitude you get a probability that is very close to zero. Your results actually make sense from that perspective. You may want to check the log-probability in areas where you know there is a higher probability of being in that component. You may also want to check the covariances for each component to make sure you don't have a degenerate solution, which is quite likely given the amount of data and dimensionality in this case. Before any of that, you may want to get more data or see if you can reduce the number of dimensions.
I forgot to mention a rather important point: The output is for the Density so keep that in mind too.
I have class imbalance problem and want to solve this using cost sensitive learning.
under sample and over sample
give weights to class to use a modified loss function
Question
Scikit learn has 2 options called class weights and sample weights. Is sample weight actually doing option 2) and class weight options 1). Is option 2) the the recommended way of handling class imbalance.
It's similar concepts, but with sample_weights you can force estimator to pay more attention on some samples, and with class_weights you can force estimator to learn with attention to some particular class. sample_weight=0 or class_weight=0 basically means that estimator doesn't need to take into consideration such samples/classes in learning process at all. Thus classifier (for example) will never predict some class if class_weight = 0 for this class. If some sample_weight/class_weight bigger than sample_weight/class_weight on other samples/classes - estimator will try to minimize error on that samples/classes in the first place. You can use user-defined sample_weights and class_weights simultaneously.
If you want to undersample/oversample your training set with simple cloning/removing - this will be equal to increasing/decreasing of corresponding sample_weights/class_weights.
In more complex cases you can also try artificially generate samples, with techniques like SMOTE.
sample_weight and class_weight have a similar function, that is to make your estimator pay more attention to some samples.
Actual sample weights will be sample_weight * weights from class_weight.
This serves the same purpose as under/oversampling but the behavior is likely to be different: say you have an algorithm that randomly picks samples (like in random forests), it matters whether you oversampled or not.
To sum it up:
class_weight and sample_weight both do 2), option 2) is one way to handle class imbalance. I don't know of an universally recommended way, I would try 1), 2) and 1) + 2) on your specific problem to see what works best.
I'm using scikit learn to perform cross validation using StratifiedKFold to compute the f1 score, but it says that some of my labels have the sum of true positives and false positives are equal to zero for some labels. I thought using StratifiedKFold should prevent this? Why am I getting this problem?
Also, is there a way to get the confusion matrix from the cross_val_score function?
Your classifier is probably classifying all data points as negative, so there are no positives. You can check that is the case by looking at the confusion matrix (docs and example here). It's hard to tell what is happening without information about your data and choice of classifier, but common causes include:
bug in your code. Check your training data contains negative data points, and that these data points contain non-zero features.
inappropriate classifier parameters. If using Naive Bayes, check your class biases. If using SVM, try using grid search over parameter values.
The sklearn classification_report function may come in handy (docs).
Re your second question: stratification ensures that each fold contains roughly the same proportion of data points from all classes. This does not mean your classifier will perform sensibly.
Update:
In a classification task (and especially when class imbalance is present) you are trading off precision for recall. Depending on your application, you can set your classifier so it does well most of the time (i.e. high accuracy) or so that it can detect the few points that you care about (i.e. high recall of the smaller classes). For example, if the task is to forward support emails to the right department, you want high accuracy. It is somewhat acceptable to misclassify the kind of email you get once a year, because you only upset one person. If your task is to detect posts by sexual predators on a children's forum, you definitely do not want to miss any of them, even if the price is that a few posts will get incorrectly flagged. Bottom line: you should optimise for your application.
Are you micro or macro averaging recall? In the former case, more weight will be given to the frequent classes (which is similar to optimising for accuracy), and in the latter all classes will have the same weight.