I am trying to build my own pmml exporter for Naive Bayes model that I have built in scikit learn. In reading the PMML documentation it seems that for each feature vector you can either output the model in terms of count data if it is discrete or as a Gaussian/Poisson distribution if it is continous. But the coefficients of my scikit learn model are in terms of Empirical log probability of features i.e p(y|x_i). Is it possible to specify the Bayes input parameters in terms of these probability rather than counts?
Since the PMML representation of the Naive Bayes model implements representing joint probabilities via the "PairCounts" element, one can simply replace that ratio with the probabilities output (not the log probability). Since the final probabilities are normalized, the difference doesn't matter. If the requirements involve a large number of proabilities which are mostly 0, the "threshold" attribute of the model can be used to set the default values for such probabilities.
Related
I'm performing some (binary)text classification with two different classifiers on the same unbalanced data. i want to compare the results of the two classifiers.
When using sklearns logistic regression, I have the option of setting the class_weight = 'balanced' for sklearn naive bayes, there is no such parameter available.
I know, that I can just randomly sample from the bigger class in order to end up with equal sizes for both classes, but then the data is lost.
Why is there no such parameter for naive bayes? I guess it has something to do with the nature of the algorithm, but cant find anything about this specific matter. I also would like to know what the equivalent would be? How to achieve a similar effect (that the classifier is aware of the imbalanced data and gives more weight to the minority class and less to the majority class)?
I'm writing this partially in response to the other answer here.
Logistic regression and naive Bayes are both linear models that produce linear decision boundaries.
Logistic regression is the discriminative counterpart to naive Bayes (a generative model). You decode each model to find the best label according to p(label | data). What sets Naive Bayes apart is that it does this via Bayes' rule: p(label | data) ∝ p(data | label) * p(label).
(The other answer is right to say that the Naive Bayes features are independent of each other (given the class), by the Naive Bayes assumption. With collinear features, this can sometimes lead to bad probability estimates for Naive Bayes—though the classification is still quite good.)
The factoring here is how Naive Bayes handles class imbalance so well: it's keeping separate books for each class. There's a parameter for each (feature, label) pair. This means that the super-common class can't mess up the super-rare class, and vice versa.
There is one place that the imbalance might seep in: the p(labels) distribution. It's going to match the empirical distribution in your training set: if it's 90% label A, then p(A) will be 0.9.
If you think that the training distribution of labels isn't representative of the testing distribution, you can manually alter the p(labels) values to match your prior belief about how frequent label A or label B, etc., will be in the wild.
Logistic Regression is a linear model, ie it draws a straight line through your data and the class of a datum is determined by which side of the line it's on. This line is just a linear combination (a weighted sum) of your features, so we can adjust for imbalanced data by adjusting the weights.
Naïve Bayes, on the other hand, works by calculating the conditional probability of labels given individual features, then uses the Naïve Bayes assumption(features are independent) to calculate the probability of a datum having a particular label (by multiplying the conditional probability of each feature and scaling). There is no obvious parameter to adjust to account for imbalanced classes.
Instead of undersampling, you could try oversampling - expanding the smaller class with duplicates or slightly adjusted data or look into other approaches based on your problem domain (since you're doing text classification, these answers have some suggested approaches).
I have an imbalanced binary classification problem. The ratio for positive to negative class is about 1:10. I trained an XGBoost tree model to predict these two classes using continuous and categorical data as input.
Using this XGBoost library, I predict the probability of new inputs using predict_proba. I am assuming the probability values output here is the likelihood of these new test data being the positive class? Say I have an entire test set with test labels, how am I able to evaluate the quality of these (I'm assuming) likelihoods?
I am looking for a Machine learning approach to find the most likely class lable (with the probability value) for a given feature vector. I have a training set for n classes and most of the feature vector consist of boolean values. Till now I was thinking of counting the number of True values for features and normalizing ( for eg m= number of training samples with value True for a feature and n =number of training samples. feat_val=m/n) it to create a representational feature vector for a class. Once created, similarity measures like cosine distance or eucledian distance between the class representation vector and the given feature vector.
Can anyone suggest whether this approach will be worth implementing?
The problem you are trying to solve is called classification and is a major part of supervised learning. Great place to start is an open source library called scikit-learn and their documentation (try this).
There are a lot of classification models available but once you pick a specific one and train it then you simply use the predict_proba method to get probabilities for a given feature vector or matrix.
Sorry for some grammatical mistakes and misuse of words.
I am currently working with text classification, trying to classify the email.
After my research, i found out Multinomial Naive Bayes and Bernoulli Naive Bayes is more often used for text classification.
Bernoulli just cares about whether the word happens or not.
Multinomial cares about the number of occurrence of the word.
For Gaussian Naive Bayes, it's usually been used for continuous data and data with normal distribution, eg: height,weight
But what is the reason that we don't use Gaussian Naive Bayes for text classification?
Any bad things will happen if we apply it to text classification?
We use algorithm based on the kind of dataset we have -
Bernoulli Naive bayes is good at handling boolean/binary attributes, while Multinomial Naive bayes is good at handling discrete values and Gaussian naive bayes is good at handling continuous values.
Consider three scenarios:
Consider a dataset which has columns like has_diabetes, has_bp, has_thyroid and then you classify the person as healthy or not. In such a scenario Bernoulli NB will work well.
Consider a dataset that has marks of various students of various subjects and you want to predict, whether the student is clever or not. Then in this case multinomial NB will work fine.
Consider a dataset that has weight of students and you are predicting height of them, then GaussiaNB will well in this case.
Bayes Classifier use probabilistic rules, the three ones you have mentioned related to the following rules:
Bayesian Probability: https://en.wikipedia.org/wiki/Bayesian_probability
Gaussian Distribution: https://en.wikipedia.org/wiki/Normal_distribution
Bernoulli Distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
Multinomial Distribution: https://en.wikipedia.org/wiki/Multinomial_distribution
You have to select the probability rule to use regarding the data you have (or try them all).
I think that what you have read on website or in research papers relates to the fact that email data usually follow a Bernoulli or Multinomial distribution. You can and I encourage you try with the Gaussian distribution, you should figure out very rapidly if you data can be fitted in a Gaussian distribution.
However, I would advise that you read the links above, you will have a better understanding of your work if you have a feeling of the reasons why the solution A or B works better than solution C.
Is there anything in scikit learn that can help me with the following?
I need a Bayesian network that is capable of taking continuous valued inputs and training against continuous valued targets. I then want to feed in new, previously unseen continuous inputs and receive estimates of the target values. Preferably with a way to measure confidence of the predictions. (PDFs perhaps?)
I am uncertain whether this would be considered a Naive Bayes Classifier or not.
I keep looking at GaussianNB but I just cannot see how it could be used in this way.
I'd like one that support "independence of irrelevant alternatives"
Any advice is greatly appreciated.
You are talking about regression, not classification. Naive Bayes Classifier is not a regression model. Check out numerous scikit-learn's regressors. IN particular, your could be interested in Bayesian Ridge Regression.