I'm performing some (binary)text classification with two different classifiers on the same unbalanced data. i want to compare the results of the two classifiers.
When using sklearns logistic regression, I have the option of setting the class_weight = 'balanced' for sklearn naive bayes, there is no such parameter available.
I know, that I can just randomly sample from the bigger class in order to end up with equal sizes for both classes, but then the data is lost.
Why is there no such parameter for naive bayes? I guess it has something to do with the nature of the algorithm, but cant find anything about this specific matter. I also would like to know what the equivalent would be? How to achieve a similar effect (that the classifier is aware of the imbalanced data and gives more weight to the minority class and less to the majority class)?
I'm writing this partially in response to the other answer here.
Logistic regression and naive Bayes are both linear models that produce linear decision boundaries.
Logistic regression is the discriminative counterpart to naive Bayes (a generative model). You decode each model to find the best label according to p(label | data). What sets Naive Bayes apart is that it does this via Bayes' rule: p(label | data) ∝ p(data | label) * p(label).
(The other answer is right to say that the Naive Bayes features are independent of each other (given the class), by the Naive Bayes assumption. With collinear features, this can sometimes lead to bad probability estimates for Naive Bayes—though the classification is still quite good.)
The factoring here is how Naive Bayes handles class imbalance so well: it's keeping separate books for each class. There's a parameter for each (feature, label) pair. This means that the super-common class can't mess up the super-rare class, and vice versa.
There is one place that the imbalance might seep in: the p(labels) distribution. It's going to match the empirical distribution in your training set: if it's 90% label A, then p(A) will be 0.9.
If you think that the training distribution of labels isn't representative of the testing distribution, you can manually alter the p(labels) values to match your prior belief about how frequent label A or label B, etc., will be in the wild.
Logistic Regression is a linear model, ie it draws a straight line through your data and the class of a datum is determined by which side of the line it's on. This line is just a linear combination (a weighted sum) of your features, so we can adjust for imbalanced data by adjusting the weights.
Naïve Bayes, on the other hand, works by calculating the conditional probability of labels given individual features, then uses the Naïve Bayes assumption(features are independent) to calculate the probability of a datum having a particular label (by multiplying the conditional probability of each feature and scaling). There is no obvious parameter to adjust to account for imbalanced classes.
Instead of undersampling, you could try oversampling - expanding the smaller class with duplicates or slightly adjusted data or look into other approaches based on your problem domain (since you're doing text classification, these answers have some suggested approaches).
Related
I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.
Sorry for some grammatical mistakes and misuse of words.
I am currently working with text classification, trying to classify the email.
After my research, i found out Multinomial Naive Bayes and Bernoulli Naive Bayes is more often used for text classification.
Bernoulli just cares about whether the word happens or not.
Multinomial cares about the number of occurrence of the word.
For Gaussian Naive Bayes, it's usually been used for continuous data and data with normal distribution, eg: height,weight
But what is the reason that we don't use Gaussian Naive Bayes for text classification?
Any bad things will happen if we apply it to text classification?
We use algorithm based on the kind of dataset we have -
Bernoulli Naive bayes is good at handling boolean/binary attributes, while Multinomial Naive bayes is good at handling discrete values and Gaussian naive bayes is good at handling continuous values.
Consider three scenarios:
Consider a dataset which has columns like has_diabetes, has_bp, has_thyroid and then you classify the person as healthy or not. In such a scenario Bernoulli NB will work well.
Consider a dataset that has marks of various students of various subjects and you want to predict, whether the student is clever or not. Then in this case multinomial NB will work fine.
Consider a dataset that has weight of students and you are predicting height of them, then GaussiaNB will well in this case.
Bayes Classifier use probabilistic rules, the three ones you have mentioned related to the following rules:
Bayesian Probability: https://en.wikipedia.org/wiki/Bayesian_probability
Gaussian Distribution: https://en.wikipedia.org/wiki/Normal_distribution
Bernoulli Distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
Multinomial Distribution: https://en.wikipedia.org/wiki/Multinomial_distribution
You have to select the probability rule to use regarding the data you have (or try them all).
I think that what you have read on website or in research papers relates to the fact that email data usually follow a Bernoulli or Multinomial distribution. You can and I encourage you try with the Gaussian distribution, you should figure out very rapidly if you data can be fitted in a Gaussian distribution.
However, I would advise that you read the links above, you will have a better understanding of your work if you have a feeling of the reasons why the solution A or B works better than solution C.
Given a classification problem, sometimes we do not just predict a class, but need to return the probability that it is a class.
i.e. P(y=0|x), P(y=1|x), P(y=2|x), ..., P(y=C|x)
Without building a new classifier to predict y=0, y=1, y=2... y=C respectively. Since training C classifiers (let's say C=100) can be quite slow.
What can be done to do this? What classifiers naturally can give all probabilities easily (one I know is using neural network with 100 out nodes)? But if I use traditional random forests, I can't do that, right? I use the Python Scikit-Learn library.
If you want probabilities, look for sklearn-classifiers that have method: predict_proba()
Sklearn documentation about multiclass:[http://scikit-learn.org/stable/modules/multiclass.html]
All scikit-learn classifiers are capable of multiclass classification. So you don't need to build 100 models yourself.
Below is a summary of the classifiers supported by scikit-learn grouped by strategy:
Inherently multiclass: Naive Bayes, LDA and QDA, Decision Trees,
Random Forests, Nearest Neighbors, setting multi_class='multinomial'
in sklearn.linear_model.LogisticRegression.
Support multilabel: Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression.
One-Vs-One: sklearn.svm.SVC.
One-Vs-All: all linear models exceptsklearn.svm.SVC.
Random forests do indeed give P(Y/x) for multiple classes. In most cases
P(Y/x) can be taken as:
P(Y/x)= the number of trees which vote for the class/Total Number of trees.
However you can play around with this, for example in one case if the highest class has 260 votes, 2nd class 230 votes and other 5 classes 10 votes, and in another case class 1 has 260 votes, and other classes have 40 votes each, you migth feel more confident in your prediction in 2nd case as compared to 1st case, so you come up with a confidence metric according to your use case.
I am trying to build my own pmml exporter for Naive Bayes model that I have built in scikit learn. In reading the PMML documentation it seems that for each feature vector you can either output the model in terms of count data if it is discrete or as a Gaussian/Poisson distribution if it is continous. But the coefficients of my scikit learn model are in terms of Empirical log probability of features i.e p(y|x_i). Is it possible to specify the Bayes input parameters in terms of these probability rather than counts?
Since the PMML representation of the Naive Bayes model implements representing joint probabilities via the "PairCounts" element, one can simply replace that ratio with the probabilities output (not the log probability). Since the final probabilities are normalized, the difference doesn't matter. If the requirements involve a large number of proabilities which are mostly 0, the "threshold" attribute of the model can be used to set the default values for such probabilities.