How to handle Imbalanced Datatset and outliers in python?

How to handle Imbalanced Datatset and outliers in python? - python

I have 2 doubts :
If we have a classification problem with a dataframe that has large no of features (columns > 100) and if say 20/30 of them are highly correlated and the target columns (y) is very skewed towards one class ;
should we first remove the imbalance using Imblearn or should we drop the highly correlated columns ?
In a classification problem should we first standardise the data or handle the outliers ?

There is no "true" answer to your questions - the approach to take highly depends on your setting, the models you apply and the goals at hand.
The topic of class imbalance has been discussed elsewhere (for example here and here).
A valid reason for oversampling/undersampling your positive or negative class training examples could be the knowledge that the true incidence of positive instances is higher (lower) than your training data suggests. Then you might want to apply sampling techniques to achieve a positive/negative class balance that matches that prior knowledge.
While not really dealing with imbalance in your label distribution, your specific setting may warrant assigning different costs to false positives and false negatives (e.g. the cost of misclassifying a cancer patient as healthy may be higher than vice versa). This you can deal with by e.g. adapting your cost function (e.g. a false negative incurring higher cost than a false negative) or by performing some kind of threshold optimization after training (to e.g. reach a certain precision/recall in cross-validation).
The problem of highly correlated features occurs with models that assume that there is no correlation between features. For example, if you have an issue with multicollinearity in your feature space, parameter estimates in logistic regressions may be off. Whether or not there is multicollinearity you can for example check using the variance inflation factor (VIF). However, not all models carry such an assumption, so you might be save disregarding the issue depending on your setting.
Same goes for standardisation: This may not be necessary (e.g. tree classifiers), but other method may require it (e.g. PCA).
Whether or not to handle outliers is a difficult question. First you would have to define what an outlier is - are they e.g. a result of human error? Do you expect seeing similar instances out in the wild? If you can establish that your model performs better if you train it with outliers removed (on a holdout validation or test set), then: sure, go for it. But keep potential outliers in for validation if you plan to apply your model on streams of data which may produce similar outliers.

Related

Continous data prediction with all categorical features analysis

I have a case where I want to predict columns H1 and H2 which are continuous data with all categorical features in the hope of getting a combination of features that give optimal results for H1 and H2, but the distribution of the categories is uneven, there are some categories which only amount to 1,
Heres my data :
and my information of categories frequency in each column:
what I want to ask:
Does the imbalance of the features of the categories greatly affect the predictions? what is the right solution to deal with the problem?
How do you know the optimal combination? do you have to run a data test simulation predicting every combination of features with the created model?
What analytical technique is appropriate to determine the relationship between features on H1 and H2? So far I'm converting category data using one hot encoding and then calculating the correlation map
What ML model can be applied to my case? until now I have tried the RF, KNN, and SVR models but the RMSE score still high
What keywords that have similar cases and can help me to search for articles on google, this is my first time working on an ML/DS case for a paper.
thank you very much

A prediction based on a single observation won't be too reliable, of course. Binning rare categories into a sort of 'other' category is one common approach.
Feature selection is a vast topic (g: filter methods, embedded methods, wrapper methods). Personally I prefer studying mutual information and variance inflation factor first.
We cannot rely on Pearson's correlation when talking about categorical or binary features. The basic approach would be grouping your dataset by categories and comparing the target distributions for each one, running statistical tests perhaps to check whether the difference is significant. Also g: ANOVA, Kendall rank.
That said, preprocessing your data to get rid of useless or redundant features often yields much more improvement than using more complex models or hyperparameter tuning. Regardless, trying out gradient boosting models never hurts (catboost even provides a robust automatic handling of categorical features). ExtraTreesRegressor is less prone to overfitting than classic RF. Linear models should not be ignored either, especially ones like Lasso with embedded feature selection capability.

Grouping low frequency levels of categorical variables to improve machine learning performance

I'm trying to find ways to improve performance of machine learning models either binary classification, regression or multinomial classification.
I'm now looking at the topic categorical variables and trying to combine low occuring levels together. Let's say a categorical variable has 10 levels where 5 levels account for 85% of the total frequency count and the 5 levels remaining account for the 15% remaining.
I'm currently trying different thresholds (30%, 20%, 10%) to combine levels together. This means I combine together the levels which represent either 30%, 20% or 10% of the remaining counts.
I was wondering if grouping these "low frequency groups" into a new level called "others" would have any benefit in improving the performance.
I further use a random forest for feature selection and I know that having fewer levels than orignally may create a loss of information and therefore not improve my performance.
Also, I tried discretizing numeric variables but noticed that my performance was weaker because random forests benefit from having the hability to split on their preferred split point rather than being forced to split on an engineered split point that I would have created by discretizing.
In your experience, would grouping low occuring levels together have a positive impact on performance ? If yes, would you recommend any techniques ?
Thank you for your help !

This isn't a programming question... By having fewer classes, you inherently increase the chance of randomly predicting the correct class.
Consider a stacked model (two models) where you have a primary model to classify between the overrepresented classes and the 'other' class, and then have a secondary model to classify between the classes within the 'other' class if the primary model predicts the 'other' class.

I have encountered similar question and this what I managed to find to this point:
Try numerical feature encoding on high cardinality categorical features.
There is very good library for that: category_encoders. And also medium article describing some of the methods.
Sources
https://www.kaggle.com/general/16927
https://stats.stackexchange.com/questions/411767/encoding-of-categorical-variables-with-high-cardinality
https://datascience.stackexchange.com/questions/37819/ml-models-how-to-handle-categorical-feature-with-over-1000-unique-values?newreg=f16addec41734d30bb12a845a32fe132
Cut features with low frequencies as you proposed initially. Threshold is arbitrary and usually is 1-10% of train samples. scikit-learn has option for this in it's OneHotEncoder transformer.
Sources
https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b
https://www.kaggle.com/general/16927

Scikit BinaryClassification training data selection

I'm training a BinaryClassifier on data that has 100 attributes where the positive scenario occurs only 3% of 800k items. During training, do we need to include the positives as well as the negatives instances? I'm guessing that we shouldnt as the outcome would only be binary i.e. if the model is trained on positives, then a weak match would mean that it's negative.
If in case i do need to include both then would the pandas DataFrame's sample method be reliable?
Thank you!

If you're asking how to handle an imbalanced dataset, there are many blog posts online on that topic, e.g. here. One possible way to use pandas' sample method would be to set the weights parameter to the frequency of the other class, i.e. 0.97 for positive instances and 0.03 for negative ones, thereby correcting the imbalance by oversampling.
But if you're saying that you could theoretically fit a model to the distribution of the positive instances and, during testing, label all outliers as negative instances – that is possible, although not advisable. That approach would certainly perform worse than one that learns from both classes. Furthermore, binary classification algorithms like scikit-learn's always assume instances from both classes.

If you are training a binary classifier you will need to have two outputs in your training dataset.
At least if you want your classifier to work.
What you have is an unbalanced dataset, here are some ways to address this problem:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Accuracy difference on normalization in KNN

I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.

To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!

That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.

If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.

What is "The sum of true positives and false positives are equal to zero for some labels." mean?

I'm using scikit learn to perform cross validation using StratifiedKFold to compute the f1 score, but it says that some of my labels have the sum of true positives and false positives are equal to zero for some labels. I thought using StratifiedKFold should prevent this? Why am I getting this problem?
Also, is there a way to get the confusion matrix from the cross_val_score function?

Your classifier is probably classifying all data points as negative, so there are no positives. You can check that is the case by looking at the confusion matrix (docs and example here). It's hard to tell what is happening without information about your data and choice of classifier, but common causes include:
bug in your code. Check your training data contains negative data points, and that these data points contain non-zero features.
inappropriate classifier parameters. If using Naive Bayes, check your class biases. If using SVM, try using grid search over parameter values.
The sklearn classification_report function may come in handy (docs).
Re your second question: stratification ensures that each fold contains roughly the same proportion of data points from all classes. This does not mean your classifier will perform sensibly.
Update:
In a classification task (and especially when class imbalance is present) you are trading off precision for recall. Depending on your application, you can set your classifier so it does well most of the time (i.e. high accuracy) or so that it can detect the few points that you care about (i.e. high recall of the smaller classes). For example, if the task is to forward support emails to the right department, you want high accuracy. It is somewhat acceptable to misclassify the kind of email you get once a year, because you only upset one person. If your task is to detect posts by sexual predators on a children's forum, you definitely do not want to miss any of them, even if the price is that a few posts will get incorrectly flagged. Bottom line: you should optimise for your application.
Are you micro or macro averaging recall? In the former case, more weight will be given to the frequent classes (which is similar to optimising for accuracy), and in the latter all classes will have the same weight.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.