NLTK certainty measure? - python

In NLTK, if I write a NaiveBayes classifier for say movie reviews (determining if positive or negative), how can I determine the classifier "certainty" when classify a particular review? That is, I know how to run an 'accuracy' test on a given test set to see the general accuracy of the classifier. But is there anyway to have NLTk output its certainess? (perhaps on the basis on the most informative features...)
Thanks
A

I am not sure about the NLTK implementation of Naive Bayes, but the Naive Bayes algorithm outputs probabilities of class membership. However, they are horribly calibrated.
If you want good measures of certainty, you should use a different classification algorithm. Logistic regression will do a decent job at producing calibrated estimates.

nltk.classify.util.log_likelihood. For this problem, you can also try measuring the results by precision, recall, F-score at the token level, that is, scores for positive and negative respectively.

Related

Imbalanced Dataset - Binary Classification Python

I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.

Difference of three Naive Bayes classifiers

Sorry for some grammatical mistakes and misuse of words.
I am currently working with text classification, trying to classify the email.
After my research, i found out Multinomial Naive Bayes and Bernoulli Naive Bayes is more often used for text classification.
Bernoulli just cares about whether the word happens or not.
Multinomial cares about the number of occurrence of the word.
For Gaussian Naive Bayes, it's usually been used for continuous data and data with normal distribution, eg: height,weight
But what is the reason that we don't use Gaussian Naive Bayes for text classification?
Any bad things will happen if we apply it to text classification?
We use algorithm based on the kind of dataset we have -
Bernoulli Naive bayes is good at handling boolean/binary attributes, while Multinomial Naive bayes is good at handling discrete values and Gaussian naive bayes is good at handling continuous values.
Consider three scenarios:
Consider a dataset which has columns like has_diabetes, has_bp, has_thyroid and then you classify the person as healthy or not. In such a scenario Bernoulli NB will work well.
Consider a dataset that has marks of various students of various subjects and you want to predict, whether the student is clever or not. Then in this case multinomial NB will work fine.
Consider a dataset that has weight of students and you are predicting height of them, then GaussiaNB will well in this case.
Bayes Classifier use probabilistic rules, the three ones you have mentioned related to the following rules:
Bayesian Probability: https://en.wikipedia.org/wiki/Bayesian_probability
Gaussian Distribution: https://en.wikipedia.org/wiki/Normal_distribution
Bernoulli Distribution: https://en.wikipedia.org/wiki/Bernoulli_distribution
Multinomial Distribution: https://en.wikipedia.org/wiki/Multinomial_distribution
You have to select the probability rule to use regarding the data you have (or try them all).
I think that what you have read on website or in research papers relates to the fact that email data usually follow a Bernoulli or Multinomial distribution. You can and I encourage you try with the Gaussian distribution, you should figure out very rapidly if you data can be fitted in a Gaussian distribution.
However, I would advise that you read the links above, you will have a better understanding of your work if you have a feeling of the reasons why the solution A or B works better than solution C.

Accuracy of lexicon-based sentiment analysis

I'm performing different sentiment analysis techniques for a set of Twitter data I have acquired. They are lexicon based (Vader Sentiment and SentiWordNet) and as such require no pre-labeled data.
I was wondering if there was a method (like F-Score, ROC/AUC) to calculate the accuracy of the classifier. Most of the methods I know require a target to compare the result to.
What I did for my research is take a small random sample of those tweets and manually label them as either positive or negative. You can then calculate the normalized scores using VADER or SentiWordNet and compute the confusion matrix for each which will give you your F-score etc.
Although this may not be a particularly good test, as it depends on the sample of tweets you use. For example you may find that SentiWordNet classes more things as negative than VADER and thus appears to have the higher accuracy if your random sample are mostly negative.
The short answer is no, I don't think so. (So, I'd be very interested if someone else posts a method.)
With some unsupervised machine learning techniques you can get some measurement of error. E.g. an autoencoder gives you an MSE (representing how accurately the lower-dimensional representation can be reconstructed back to the original higher-dimensional form).
But for sentiment analysis all I can think of is to use multiple algorithms and measure agreement between them on the same data. Where they all agree on a particular sentiment you mark it as more reliable prediction, where they all disagree you mark it as unreliable prediction. (This relies on none of the algorithms have the same biases, which is probably unlikely.)
The usual approach is to label some percentage of your data, and assume/hope it is representative of the whole data.

Make graphviz from sklearn RandomForestClassifier (not from individual clf.estimators_)

Python. Sklearn. RandomForestClassifier. After fitting RandomForestClassifier, does it produce some kind of single "best" "averaged" consensus tree that could be used to create a graphviz?
Yes, I looked at the documentation. No it doesn't say anything about it. No RandomForestClassifier doesn't have a tree_ attribute. However, you can get the individual trees in the forest from clf.estimators_ so I know I could make a graphviz from one of those. There is an example of that here. I could even score all trees and find the tree with the highest score amongst the forest and choose that one... but that's not what I'm asking.
I want to make a graphviz from the "averaged" final random forest classifier result. Is this possible? Or, does the final classifier use the underlying trees to produce scores and predictions?
A RandomForest is an ensemble method that uses averaging to do prediction, i.e. all the fitted sub classifiers are used, typically (but not always) in a majority voting ensemble, to arrive at the final prediction. This is usually true for all ensemble methods. As Vivek Kumar points out in the comments, the prediction is not necessarily always a pure majority vote but can also be a weighted majority or indeed some other exotic form of combining the individual predictions (research on ensemble methods is ongoing although somewhat sidelined by deep learning).
There is no average tree that could be graphed, only the decision stumps that were trained from random sub samples of the whole dataset and the predictions that each of those produces. It's the predictions themselves that are averaged, not the trees / stumps.
Just for completeness, from the wikipedia article: (emphasis mine)
Random forests or random decision forests1[2] are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
mode being the most common value, in other words the majority prediction.

Does scikit learn include a Naive Bayes classifier with continuous inputs?

Is there anything in scikit learn that can help me with the following?
I need a Bayesian network that is capable of taking continuous valued inputs and training against continuous valued targets. I then want to feed in new, previously unseen continuous inputs and receive estimates of the target values. Preferably with a way to measure confidence of the predictions. (PDFs perhaps?)
I am uncertain whether this would be considered a Naive Bayes Classifier or not.
I keep looking at GaussianNB but I just cannot see how it could be used in this way.
I'd like one that support "independence of irrelevant alternatives"
Any advice is greatly appreciated.
You are talking about regression, not classification. Naive Bayes Classifier is not a regression model. Check out numerous scikit-learn's regressors. IN particular, your could be interested in Bayesian Ridge Regression.

Categories

Resources