Imbalanced Dataset - Binary Classification Python

Imbalanced Dataset - Binary Classification Python - python

I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks

there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.

Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.

Related

How to use KMeans clustering to improve the accuracy of a logistic regression model?

I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?

I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.

How to achieve regression model without underfitting or overfitting

I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?

Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.

Proper way to handle highly imbalanced data - binary classification

I have a really large dataset with 60 million rows and 11 features.
It is highly imbalanced dataset, 20:1 (signal:background).
As I saw, there are two ways to tackle this problem:
First: Under-sampling/Oversampling.
I have two problems/questions in this way.
If I make under-sampling before train test split, I am losing a lot of data.
But more important, If I train a model on a balanced dataset, I am losing information about the frequency of my signal data(let's say the frequency of benign tumor over malignant), and because model is trained on and evaluated, model will perform well. But if sometime in the future I am going to try my model on new data, it will bad perform because real data is imbalanced.
If I made undersampling after train test split, my model will underfit because it will be trained on balanced data but validated/tested on imbalanced.
Second - class weight penalty
Can I use class weight penalty for XBG, Random Forest, Logistic Regression?
So, everybody, I am looking for an explanation and idea for a way of work on this kind of problem.
Thank you in advance, I will appreciate any of your help.

I suggest this quick paper by Breiman (author of Random Forest):
Using Random Forest to Learn Imbalanced Data
The suggested methods are weighted RF, where you compute the splits using weighted Gini (or Entropy, which in my opinion is better when weighted), and Balanced Random Forest, where you try to balance the classes during the bootstrap.
Both methods can be implemented also for boosted trees!

One of the suggested methodologies could be using Synthetic Minority oversampling technique (SMOTE) which attempts to balance the data set by creating synthetic instances. And train the balanced data set using any of the classification algorithm.
For comparing multiple models, Area Under the ROC Curve (AUC score) can be used to determine which model is superior.
This guide will be able to give you some ideas on different methodologies you can use and compare to resolve imbalance problem.

The above issue is pretty common when dealing with medical datasets and other types of fault detection where one of the classes (ill-effect) is always under-represented.
The best way to tackle this is to generate folds and apply cross validation. The folds should be generated in a way to balance the classes in each fold. In your case this creates 20 folds, each has the same under-represented class and a different fraction of the over-represented class.
Generating balanced folds and using cross validation also results in a better generalised and robust model. In your case, 20 folds might seem to harsh, so you can possibly create 10 folds each with a 2:1 class ratio.

How to increase true positive in your classification Machine Learning model?

I am new to Machine Learning
I have a dataset which has highly unbalanced classes(dominated by negative class) and contains more than 2K numeric features and the target is [0,1]. I have trained a logistics regression though I am getting an accuracy of 89% but from confusion matrix, it was found the model True positive is very low. Below are the scores of my model
Accuracy Score : 0.8965989500114129
Precision Score : 0.3333333333333333
Recall Score : 0.029545454545454545
F1 Score : 0.05427974947807933
How I can increase my True Positives? Should I be using a different classification model?
I have tried the PCA and represented my data in 2 components, it increased the model accuracy up to 90%(approx) however True Positives was decreased again

There are several ways to do this :
You can change your model and test whether it performs better or not
You can Fix a different prediction threshold : here I guess you predict 0 if the output of your regression is <0.5, you could change the 0.5 into 0.25 for example. It would increase your True Positive rate, but of course, at the price of some more False Positives.
You can duplicate every positive example in your training set so that your classifier has the feeling that classes are actually balanced.
You could change the loss of the classifier in order to penalize more False Negatives (this is actually pretty close to duplicating your positive examples in the dataset)
I'm sure many other tricks could apply, here is just my favorite short-list.

I'm assuming that your purpose is to obtain a model with good classification accuracy on some test set, regardless of the form of that model.
In that case, if you have access to the computational resources, try Gradient-Boosted Trees. That's a ensemble classifier using multiple decision trees on subsets of your data, then a voting ensemble to make predictions. As far as I know, it can give good results with unbalanced class counts.
SciKitLearn has the function sklearn.ensemble.GradientBoostingClassifier for this. I have not used that particular one, but I use the regression version often and it seems good. I'm pretty sure MATLAB has this as a package too, if you have access.
2k features might be difficult for the SKL algorithm - I don't know I've never tried.

What is the size of your dataset?How many rows are we talking here?
Your dataset is not balanced and so its kind of normal for a simple classification algorithm to predict the 'majority-class' most of the times and give you an accuracy of 90%. Can you collect more data that will have more positive examples in it.
Or, just try oversampling/ under-sampling. see if that helps.
You can also use penalized version of the algorithm to impose penalty, whenever a wrong class is predicted. That may help.

You can try many different solutions.
If you have quite a lot data points. For instance you have 2k 1s and 20k 0s. You can try just dump those extra 0s only keep 2k 0s. Then train it. And also you can try to use different set of 2k 0s and same set of 2k 1s. To train multiple models. And make decision based on multiple models.
You also can try adding weights at the output layer. For instance, you have 10 times 0s than 1s. Try to multiply 10 at the 1s prediction value.
Probably you also can try to increase dropout?
And so on.

Can you fix the false negative rate in a classifier in scikit learn

I am using a Random Forest classifer in scikit learn with an imbalanced data set of two classes. I am much more worried about false negatives than false positives. Is it possible to fix the false negative rate (to, say, 1%) and ask scikit to optimize the false positive rate somehow?
If this classifier doesn't support it, is there another classifier that does?

I believe the problem of class imbalance in sklearn can be partially resolved by using the class_weight parameter.
this parameter is either a dictionary, where each class is assigned a uniform weight, or is a string that tells sklearn how to build this dictionary. For instance, setting this parameter to 'auto', will weight each class in proportion of the inverse of its frequency.
By weighting the class that is less present with a higher amount, you can end up with 'better' results.
Classifier like like SVM or logistic regression also offer this class_weight parameter.
This Stack Overflow answer gives some other ideas on how to handle class imbalance, like under sampling and oversampling.

I found this article on class imbalance problem.
http://www.chioka.in/class-imbalance-problem/
It has basically discussed the following possible solutions to summarize:
Cost function based approaches
Sampling based approaches
SMOTE (Synthetic Minority Over-Sampling Technique)
recent approaches : RUSBoost, SMOTEBagging and Underbagging
Hope It may help.

Random forests is already a bagged classifier so that should already give some good results.
One typical way of getting desired False positive or False negative accuracies is to analyze it using ROC curves
http://scikit-learn.org/stable/auto_examples/plot_roc.html
and modifying certain parameters to achieve the desired FP rates for example.
Not sure whether it would be possible to tune the random forest classifier FP rates using parameters. You can look at other classifiers based on your application.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.