Is there anything in scikit learn that can help me with the following?
I need a Bayesian network that is capable of taking continuous valued inputs and training against continuous valued targets. I then want to feed in new, previously unseen continuous inputs and receive estimates of the target values. Preferably with a way to measure confidence of the predictions. (PDFs perhaps?)
I am uncertain whether this would be considered a Naive Bayes Classifier or not.
I keep looking at GaussianNB but I just cannot see how it could be used in this way.
I'd like one that support "independence of irrelevant alternatives"
Any advice is greatly appreciated.
You are talking about regression, not classification. Naive Bayes Classifier is not a regression model. Check out numerous scikit-learn's regressors. IN particular, your could be interested in Bayesian Ridge Regression.
Related
I'm trying to make a model which can predict test scores. I'm currently using a simple linear regression model but receiving an accuracy score of close to 0 due to the fact that it's guessing a single number as the score. I was wondering if there was a way to have the model predict a range of about 10 numbers and if the true number is in that range it is marked as a correct guess.
The dataset I am using
Github page with notebook
It seems like you are using a LogisticRegression, LogisticRegression is in fact not for regression, it is for classification (for example, is the input data class a or b).
use sklearn.linear_model.LinearRegression for linear regression, read this for more details
There are also many other regression algorithms that I cannot list all in an answer. If you want to use regressions other than simple naive linear regression, read this for all available supervised learning algorithms scikit-learn provides, Ridge regression and SVR might be good places to start with.
Sorry for some grammatical mistakes and misuse of words.
I am currently working with text classification, trying to classify the email.
After my research, i found out Multinomial Naive Bayes and Bernoulli Naive Bayes is more often used for text classification.
Bernoulli just cares about whether the word happens or not.
Multinomial cares about the number of occurrence of the word.
For Gaussian Naive Bayes, it's usually been used for continuous data and data with normal distribution, eg: height,weight
But what is the reason that we don't use Gaussian Naive Bayes for text classification?
Any bad things will happen if we apply it to text classification?
We use algorithm based on the kind of dataset we have -
Bernoulli Naive bayes is good at handling boolean/binary attributes, while Multinomial Naive bayes is good at handling discrete values and Gaussian naive bayes is good at handling continuous values.
Consider three scenarios:
Consider a dataset which has columns like has_diabetes, has_bp, has_thyroid and then you classify the person as healthy or not. In such a scenario Bernoulli NB will work well.
Consider a dataset that has marks of various students of various subjects and you want to predict, whether the student is clever or not. Then in this case multinomial NB will work fine.
Consider a dataset that has weight of students and you are predicting height of them, then GaussiaNB will well in this case.
Bayes Classifier use probabilistic rules, the three ones you have mentioned related to the following rules:
Bayesian Probability: https://en.wikipedia.org/wiki/Bayesian_probability
Gaussian Distribution: https://en.wikipedia.org/wiki/Normal_distribution
Bernoulli Distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
Multinomial Distribution: https://en.wikipedia.org/wiki/Multinomial_distribution
You have to select the probability rule to use regarding the data you have (or try them all).
I think that what you have read on website or in research papers relates to the fact that email data usually follow a Bernoulli or Multinomial distribution. You can and I encourage you try with the Gaussian distribution, you should figure out very rapidly if you data can be fitted in a Gaussian distribution.
However, I would advise that you read the links above, you will have a better understanding of your work if you have a feeling of the reasons why the solution A or B works better than solution C.
I have a dataset which includes 200000 labelled training examples.
For each training example I have 10 features, including both continuous and discrete.
I'm trying to use sklearn package of python in order to train the model and make predictions but I have some troubles (and some questions too).
First let me write the code which I have written so far:
from sklearn.naive_bayes import GaussianNB
# data contains the 200 000 examples
# targets contain the corresponding labels for each training example
gnb = GaussianNB()
gnb.fit(data, targets)
predicted = gnb.predict(data)
The problem is that I get really low accuracy (too many misclassified labels) - around 20%.
However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Any thoughts or suggestions will be much appreciated.
The problem is that I get really low accuracy (too many misclassified labels) - around 20%. However I am not quite sure whether there is a problem with the data (e.g. more data is needed or something else) or with the code.
This is not big error for Naive Bayes, this is extremely simple classifier and you should not expect it to be strong, more data probably won't help. Your gaussian estimators are probably already very good, simply Naive assumptions are the problem. Use stronger model. You can start with Random Forest since it is very easy to use even by non-experts in the field.
Is this the proper way to implement a Naive Bayes classifier given a dataset with both discrete and continuous features?
No, it is not, you should use different distributions in discrete features, however scikit-learn does not support that, you would have to do this manually. As said before - change your model.
Furthermore, in Machine Learning we know that the dataset should be split into training and validation/testing sets. Is this automatically performed by sklearn or should I fit the model using the training dataset and then call predict using the validation set?
Nothing is done automatically in this manner, you need to do this on your own (scikit learn has lots of tools for that - see the cross validation pacakges).
I'm using scikit-learn in Python and I want to use BayesianRidge regression for prediction of a continuous valued target from my continuous inputs. My problem is that I also have a series of binary/categorical inputs and I dont know whether I should still use the BayesianRidge regressor.
If I supply the values as 0 or 1 (or -1, 0, 1) to the BayesianRidge regression, will I get good results? Or is there a better way to do this?
I'm still new to machine learning and I have to admit I find the scikit learn documentation to be overwhelming.
I saw this question regarding a Naive Bayes Classifier, is there a similar approach for Bayesian Ridge Regression?
Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn
In NLTK, if I write a NaiveBayes classifier for say movie reviews (determining if positive or negative), how can I determine the classifier "certainty" when classify a particular review? That is, I know how to run an 'accuracy' test on a given test set to see the general accuracy of the classifier. But is there anyway to have NLTk output its certainess? (perhaps on the basis on the most informative features...)
Thanks
A
I am not sure about the NLTK implementation of Naive Bayes, but the Naive Bayes algorithm outputs probabilities of class membership. However, they are horribly calibrated.
If you want good measures of certainty, you should use a different classification algorithm. Logistic regression will do a decent job at producing calibrated estimates.
nltk.classify.util.log_likelihood. For this problem, you can also try measuring the results by precision, recall, F-score at the token level, that is, scores for positive and negative respectively.