Sorry for some grammatical mistakes and misuse of words.
I am currently working with text classification, trying to classify the email.
After my research, i found out Multinomial Naive Bayes and Bernoulli Naive Bayes is more often used for text classification.
Bernoulli just cares about whether the word happens or not.
Multinomial cares about the number of occurrence of the word.
For Gaussian Naive Bayes, it's usually been used for continuous data and data with normal distribution, eg: height,weight
But what is the reason that we don't use Gaussian Naive Bayes for text classification?
Any bad things will happen if we apply it to text classification?
We use algorithm based on the kind of dataset we have -
Bernoulli Naive bayes is good at handling boolean/binary attributes, while Multinomial Naive bayes is good at handling discrete values and Gaussian naive bayes is good at handling continuous values.
Consider three scenarios:
Consider a dataset which has columns like has_diabetes, has_bp, has_thyroid and then you classify the person as healthy or not. In such a scenario Bernoulli NB will work well.
Consider a dataset that has marks of various students of various subjects and you want to predict, whether the student is clever or not. Then in this case multinomial NB will work fine.
Consider a dataset that has weight of students and you are predicting height of them, then GaussiaNB will well in this case.
Bayes Classifier use probabilistic rules, the three ones you have mentioned related to the following rules:
Bayesian Probability: https://en.wikipedia.org/wiki/Bayesian_probability
Gaussian Distribution: https://en.wikipedia.org/wiki/Normal_distribution
Bernoulli Distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
Multinomial Distribution: https://en.wikipedia.org/wiki/Multinomial_distribution
You have to select the probability rule to use regarding the data you have (or try them all).
I think that what you have read on website or in research papers relates to the fact that email data usually follow a Bernoulli or Multinomial distribution. You can and I encourage you try with the Gaussian distribution, you should figure out very rapidly if you data can be fitted in a Gaussian distribution.
However, I would advise that you read the links above, you will have a better understanding of your work if you have a feeling of the reasons why the solution A or B works better than solution C.
Related
I'm performing some (binary)text classification with two different classifiers on the same unbalanced data. i want to compare the results of the two classifiers.
When using sklearns logistic regression, I have the option of setting the class_weight = 'balanced' for sklearn naive bayes, there is no such parameter available.
I know, that I can just randomly sample from the bigger class in order to end up with equal sizes for both classes, but then the data is lost.
Why is there no such parameter for naive bayes? I guess it has something to do with the nature of the algorithm, but cant find anything about this specific matter. I also would like to know what the equivalent would be? How to achieve a similar effect (that the classifier is aware of the imbalanced data and gives more weight to the minority class and less to the majority class)?
I'm writing this partially in response to the other answer here.
Logistic regression and naive Bayes are both linear models that produce linear decision boundaries.
Logistic regression is the discriminative counterpart to naive Bayes (a generative model). You decode each model to find the best label according to p(label | data). What sets Naive Bayes apart is that it does this via Bayes' rule: p(label | data) ∝ p(data | label) * p(label).
(The other answer is right to say that the Naive Bayes features are independent of each other (given the class), by the Naive Bayes assumption. With collinear features, this can sometimes lead to bad probability estimates for Naive Bayes—though the classification is still quite good.)
The factoring here is how Naive Bayes handles class imbalance so well: it's keeping separate books for each class. There's a parameter for each (feature, label) pair. This means that the super-common class can't mess up the super-rare class, and vice versa.
There is one place that the imbalance might seep in: the p(labels) distribution. It's going to match the empirical distribution in your training set: if it's 90% label A, then p(A) will be 0.9.
If you think that the training distribution of labels isn't representative of the testing distribution, you can manually alter the p(labels) values to match your prior belief about how frequent label A or label B, etc., will be in the wild.
Logistic Regression is a linear model, ie it draws a straight line through your data and the class of a datum is determined by which side of the line it's on. This line is just a linear combination (a weighted sum) of your features, so we can adjust for imbalanced data by adjusting the weights.
Naïve Bayes, on the other hand, works by calculating the conditional probability of labels given individual features, then uses the Naïve Bayes assumption(features are independent) to calculate the probability of a datum having a particular label (by multiplying the conditional probability of each feature and scaling). There is no obvious parameter to adjust to account for imbalanced classes.
Instead of undersampling, you could try oversampling - expanding the smaller class with duplicates or slightly adjusted data or look into other approaches based on your problem domain (since you're doing text classification, these answers have some suggested approaches).
I am trying to classify a small dataset (around 10000 records) into two classes. I used various methods like DT, Naive Bayes and k-nn classifier. Now, I would like to set the results from one of the classifiers are my baseline and perform a statistical hypothesis testing. I am not much familiar with this field of statistical testing, and I wonder how to proceed on this.
I have been thinking of setting the DT classifier as my baseline, but I am not sure how to perform a t-test (or similar) on the data. The input dataset has 192 attributes. Should I use the classification results from two classifiers and do a paired t-test on them? For example, I could take the result from Naive Bayes and perform the paired t-test with the DT results (which is the baseline). Is this the right approach?
Also, I am confused with the explanation for null and alternate hypothesis. Could someone be willing to give an idea about how to fix the null and alternate hypothesis.
I am trying to build my own pmml exporter for Naive Bayes model that I have built in scikit learn. In reading the PMML documentation it seems that for each feature vector you can either output the model in terms of count data if it is discrete or as a Gaussian/Poisson distribution if it is continous. But the coefficients of my scikit learn model are in terms of Empirical log probability of features i.e p(y|x_i). Is it possible to specify the Bayes input parameters in terms of these probability rather than counts?
Since the PMML representation of the Naive Bayes model implements representing joint probabilities via the "PairCounts" element, one can simply replace that ratio with the probabilities output (not the log probability). Since the final probabilities are normalized, the difference doesn't matter. If the requirements involve a large number of proabilities which are mostly 0, the "threshold" attribute of the model can be used to set the default values for such probabilities.
Is there anything in scikit learn that can help me with the following?
I need a Bayesian network that is capable of taking continuous valued inputs and training against continuous valued targets. I then want to feed in new, previously unseen continuous inputs and receive estimates of the target values. Preferably with a way to measure confidence of the predictions. (PDFs perhaps?)
I am uncertain whether this would be considered a Naive Bayes Classifier or not.
I keep looking at GaussianNB but I just cannot see how it could be used in this way.
I'd like one that support "independence of irrelevant alternatives"
Any advice is greatly appreciated.
You are talking about regression, not classification. Naive Bayes Classifier is not a regression model. Check out numerous scikit-learn's regressors. IN particular, your could be interested in Bayesian Ridge Regression.
In NLTK, if I write a NaiveBayes classifier for say movie reviews (determining if positive or negative), how can I determine the classifier "certainty" when classify a particular review? That is, I know how to run an 'accuracy' test on a given test set to see the general accuracy of the classifier. But is there anyway to have NLTk output its certainess? (perhaps on the basis on the most informative features...)
Thanks
A
I am not sure about the NLTK implementation of Naive Bayes, but the Naive Bayes algorithm outputs probabilities of class membership. However, they are horribly calibrated.
If you want good measures of certainty, you should use a different classification algorithm. Logistic regression will do a decent job at producing calibrated estimates.
nltk.classify.util.log_likelihood. For this problem, you can also try measuring the results by precision, recall, F-score at the token level, that is, scores for positive and negative respectively.