NLTK Naive Bayes rejecting everything

NLTK Naive Bayes rejecting everything - python

I am attempting to use the NLTK Naive Bayes classifier to guess which articles I will like on arXiv.org and indeed I got >85% accuracy from about 500 article titles. Unfortunately, I realized it was just classifying all the articles as 'bad'
On the bright side, it gives me a list of informative features that sort of makes sense.
I have searched on this site https://stackoverflow.com/search?q=nltk+naive+bayes and looked at NLTK's book as well as another tutorial (both deal with movies reviews from NLTK's own files).
Is there anything glaringly wrong with how I train this classifier? Hopefully the code snippet underneath is enough. Is it possible rejecting all titles correct for Naive Bayes?
def title_words(doc, vocab):
""" features for document classifier"""
words = set([x.lower() for x in doc[0].replace(". ", " ").split(' ')])
features = dict([( 'contains(%s)' % word , (word in words)) for word in vocab] )
return features
def(classify)
# get articles from database
conn = sqlite3.connect('arxiv.db')
c = conn.cursor()
# each row in the database is a 3-tuple: (title, abstract, tag)
articles = c.execute("select * from arXiv").fetchall()
# build vocabulary list
for the in articles:
vocab = vocab | set([x.lower() for x in the[0].split(' ')])
# get feature dictionary from title
titles = [(title_words(x, vocab),x[2]) for x in articles]
n = len(titles)/2
train_set, test_set = titles[:n], titles[n:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(20)
conn.close()
is there a problem if the number of features is around 2000? common words like "the" and "because" are unlikely distinguish between "interesting" and "boring". I imagine instead I am looking for specialized terms in different combinations, so I was hoping the classifier would pick out which specialized terms would induce me to read...

Related

Setting n-grams for sentiment analysis with Python and TextBlob

I want to do sentiment analysis of some sentences with Python and TextBlob lib.
I know how to use that, but Is there any way to set n-grams to that?
Basically, I do not want to analyze word by word, but I want to analyze 2 words, 3 words, because phrases can carry much more meaning and sentiment.
For example, this is what I have done (it works):
from textblob import TextBlob
my_string = "This product is very good, you should try it"
my_string = TextBlob(my_string)
sentiment = my_string.sentiment.polarity
subjectivity = my_string.sentiment.subjectivity
print(sentiment)
print(subjectivity)
But how can I apply, for example n-grams = 2, n-grams = 3 etc?
Is it possible to do that with TextBlob, or VaderSentiment lib?

Here is a solution that finds n-grams without using any libraries.
from textblob import TextBlob
def find_ngrams(n, input_sequence):
# Split sentence into tokens.
tokens = input_sequence.split()
ngrams = []
for i in range(len(tokens) - n + 1):
# Take n consecutive tokens in array.
ngram = tokens[i:i+n]
# Concatenate array items into string.
ngram = ' '.join(ngram)
ngrams.append(ngram)
return ngrams
if __name__ == '__main__':
my_string = "This product is very good, you should try it"
ngrams = find_ngrams(3, my_string)
analysis = {}
for ngram in ngrams:
blob = TextBlob(ngram)
print('Ngram: {}'.format(ngram))
print('Polarity: {}'.format(blob.sentiment.polarity))
print('Subjectivity: {}'.format(blob.sentiment.subjectivity))
To change the ngram lengths, change the n value in the function find_ngrams().

There is no parameter within textblob to define n-grams as opposed to words/unigrams to be used as features for sentiment analysis.
Textblob uses a polarity lexicon to calculate the overall sentiment of a text. This lexicon contains unigrams, which means it can only give you the sentiment of a word but not a n-gram with n>1.
I guess you could work around that by feeding bi- or tri-grams into the sentiment classifier, just like you would feed in a sentence and then create a dictionary of your n-grams with their accumulated sentiment value.
But I'm not sure that this is a good idea. I'm assuming you are looking for bigrams to address problems like negation ("not bad") and the lexicon approach won't be able to use not for flipping the sentiment value for bad.
Textblob also contains an option to use a naiveBayes classifier instead of the lexicon approach. This is trained on a movie review corpus provided by nltk but the default features for training are words/unigrams as far as I can make out from peeking at the source code.
You might be able to implement your own feature extractor within there to extract n-grams instead of words and then re-train it accordingly and use for your data.
Regardless of all that, I would suggest that you use a combination of unigrams and n>1-grams as features, because dropping unigrams entirely is likely to affect your performance negatively. Bigrams are much more sparsely distributed, so you'll struggle with data sparsity problems when training.

Text similarity with gensim and cosine similarity

from gensim import corpora, models, similarities
documents = ["This is a book about cars, dinosaurs, and fences"]
# remove common words and tokenize
stoplist = set('for a of the and to in - , is'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# Remove commas
texts[0] = [text.replace(',','') for text in texts[0]]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "I like cars and birds"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi] # perform a similarity query against the corpus
print(sims)
In the above code I am comparing how much "This is a book about cars, dinosaurs, and fences" is similar to "I like cars and birds" using the cosine similarity technique.
The two sentences have effectively 1 words in common, which is "cars", however when I run the code I get that they are 100% similar. This does not make sense to me.
Can someone suggest how to improve my code so that I get a reasonable number?

These topic-modelling techniques need varied, realistic data to achieve sensible results. Toy-sized examples of just one or a few text examples don't work well – and even if they do, it's often just good luck or contrived suitability.
In particular:
a model with only one example can't sensibly create multiple topics, as there's no contrast-between-documents to model
a model presented with words it hasn't seen before ignores those words, so your test doc appears to it the same as the single word 'cars' – the only word it's seen before
In this case, both your single training document, and the test document, get modeled by LSI as having 0 contribution from the 0th topic, and positive contribution (of different magnitudes) from the 1st topic. Since cosine-similarity merely compares angle, and not magnitude, both docs are along-the-same-line-from-the-origin, and so have no angle-of-difference, and thus similarity 1.0.
But if you had better training data, and more than a single-known-word test doc, you might start to get more sensible results. Even a few dozen training docs, and a test doc with several known words, might help... but hundreds or thousands or tens-of-thousands training-docs would be even better.

Improve accuracy Naive Bayes Classifier

I wrote a simple document classifier and I am currently testing it on the Brown Corpus. However, my accuracy is still very low (0.16). I've already excluded stopwords. Any other ideas on how to improve the classifier's performance?
import nltk, random
from nltk.corpus import brown, stopwords
documents = [(list(brown.words(fileid)), category)
for category in brown.categories()
for fileid in brown.fileids(category)]
random.shuffle(documents)
stop = set(stopwords.words('english'))
all_words = nltk.FreqDist(w.lower() for w in brown.words() if w in stop)
word_features = list(all_words.keys())[:3000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

If that's really your code, it's a wonder you get anything at all. w.lower is not a string, it's a function (method) object. You need to add the parentheses:
>>> w = "The"
>>> w.lower
<built-in method lower of str object at 0x10231e8b8>
>>> w.lower()
'the'
(But who knows really. You need to fix the code in your question, it's full of cut-and-paste errors and who knows what else. Next time, help us help you better.)

I would start by changing the first comment from:
import corpus documents = [(list(brown.words(fileid)), category) to:
documents = [(list(brown.words(fileid)), category) ...
In addition to changing the w.lower as the other answer says.
Changing this and following these two links below which implements a basic Naive Classifier without removing stop words gave me an accuracy of 33% which is a lot higher than 16%.
https://pythonprogramming.net/words-as-features-nltk-tutorial/
https://pythonprogramming.net/naive-bayes-classifier-nltk-tutorial/?completed=/words-as-features-nltk-tutorial/
There are lots of things you can try to see if it improves your accuracy:
1- removing stop words
2- removing punctuation
3- removing the most common words and the least common words
4- normalizing the text
5- stemming or lemmatizing the text
6- I think this feature-set gives True if the word is present and False if it is not present. You can implement a count or a frequency.
7- You can use unigrams, bigrams and trigrams or combinations of those.
Hope that helped

How to extract rows with only meaningful text in a column

I have a large excel file like the following:
Timestamp Text Work Id
5/4/16 17:52 rain a lot the packs maybe damage. Delivery XYZ
5/4/16 18:29 wh. screen Other ABC
5/4/16 14:54 15107 Lane Pflugerville,
TX customer called me and his phone
number and my phone numbers were not
masked. thank you customer has had a
stroke and items were missing from his
delivery the cleaning supplies for his
wet vacuum steam cleaner. he needs a
call back from customer support Delivery YYY
5/6/16 13:05 How will I know if I Signing up ASX
5/4/16 23:07 an quality Delivery DFC
I want to work only on the "Text" column and then eliminate those row that have basically just have gibberish in the "Text" column (rows 2,4,5 from the above example).
I'm reading only the 2nd column as follow:
import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
timestamp, text = sheet.row_values(row_index, end_colx=2)
text)
print (text)
How do I remove the gibberish rows? I have an idea that I need to work with nltk and have a positive corpus (one that does not have any gibberish), one negative corpus (only having gibberish text), and train my model with it. But how do I go about implementing it? Please help!!

You can use nltk to do the following.
import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())
'a' in english_words
True
'dog' in english_words
True
'asdasdase' in english_words
False
How to get individual words in nltk from string:
individual_words_front_string = nltk.word_tokenize('This is my text from text column')
individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']
For each rows text column, test the individual words to see if they are in the english dictionary. If they all are, you know that rows text column us not gibberish.
If your definition of gibberish vs non-gibberish is different than english words found in nltk, you can use the same process above, just with a different list of acceptable words.
How to accept numbers and street addresses?
Simple way to determine if something is a number.
word = '32423432'
word.isdigit()
True
word = '32423432ds'
word.isdigit()
False
Addresses are more difficult. You can find info on that here:Parsing Addresses, and probably many other places. Of course you can always use the above logic if you have access to a list of cities, states, roads...etc.
Will it fail if any one word is False?
It's your code you decide. Perhaps you can mark something as gibberish if x% of words in the text are false?
How to determine if grammar is correct?
This is a bigger topic, and a more in-depth explanation can be found at the following link:
Checking Grammar. But the above answer will just check if words are in the nltk corpus, not whether or not the sentence is grammatically correct.

Separating good text from 'gibber' is not a trivial task, especially if you are dealing with text messages / chats (that's what it looks like to me).
A misspelled word does not make a sample unusable and even a syntactically wrong sentence should not disqualify the whole text. That's a standard you could use for newspaper texts, but not for raw, user generated content.
I would annotate a corpus in which you separate the good samples from the bad ones and train a simple classifier on in. Annotation does not have to be a big effort, since these gibberish texts are shorter than the good ones and should be easy to recognise (at least some). Also, you could try to start with a corpus size of ~100 datapoints (50 good / 50 bad) and expand it when the first model is more or less working.
This is a sample code that I always use for text classification. You need to install scikit-learn and numpy though:
import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Prepare data
def prepare_data(data):
"""
data is expected to be a list of tuples of category and texts.
Returns a tuple of a list of lables and a list of texts
"""
random.shuffle(data)
return zip(*data)
# Format training data
training_data = [
("good", "rain a lot the packs maybe damage."),
("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner. he needs a call back from customer support "),
("gibber", "wh. screen"),
("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)
# Format test set
test_data = [
("gibber", "an quality"),
("good", "<datapoint with valid text>",
# ...
]
test_labels, test_texts = prepare_data(test_data)
# Create feature vectors
"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
# Test performance
X_test = vectorizer.transform(test_texts)
y_test = test_labels
# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)
# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))
# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))
# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]
A few words about how it works: The count vectorizer tokenizes the text and creates vectors containing the counts for all words in the corpus. Based upon these vectors, the classifier tries to recognise patters to distinguish between both categories. A text with only a few and uncommon (b/c misspelled) words would rather be in the 'gibber' category, while a text with a lot of words that are typical for common sentences (think of all the stop words here: 'I', 'you', 'is'... ) is more prone to be a good text.
If this method works for you, you should also try other classifiers and use the first model to semi-automatically annotate a larger training corpus.

Probability tree for sentences in nltk employing both lookahead and lookback dependencies

Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.

If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context).
One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.
Extending your code:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token, rev_token in zip(tokens[1:], tokens):
ngram_rev[token][rev_token] += 1
for token in ngram:
total = np.log(np.sum(ngram[token].values()))
total_rev = np.log(np.sum(ngram_rev[token].values()))
ngram[token] = {nxt: np.log(v) - total
for nxt, v in ngram[token].items()}
ngram_rev[token] = {prv: np.log(v) - total_rev
for prv, v in ngram_rev[token].items()}
Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.
You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.

The normal ngram algorithm traditionally works with prior context only, and for good reason: A bigram tagger makes decisions by considering the tags of the last two words, plus the current word. So unless you tag in two passes, the tag of the next word is not yet known. But you are interested in word ngrams, not tag ngrams, so nothing keeps you from training an ngram tagger where the ngram consists of words from both sides. And you can indeed do it easily with the NLTK.
The NLTK's ngram taggers all make tag ngrams, from the left; but you can easily derive your own tagger from their abstract base class, ContextTagger:
import nltk
from nltk.tag import ContextTagger
class TwoSidedTagger(ContextTagger):
left = 2
right = 1
def context(self, tokens, index, history):
left = self.left
right = self.right
tokens = tuple(t.lower() for t in tokens)
if index < left:
tokens = ("<start>",) * left + tokens
index += left
if index + right >= len(tokens):
tokens = tokens + ("<end>",) * right
return tokens[index-left:index+right+1]
This defines a tetragram tagger (2+1+1) where the current word is third in the ngram, not last as usual. You can then initialize and train a tagger just like the regular ngram taggers (see chapter 5 of the NLTK book, especially sections 5.4ff). Let's see first how you'd build a part-of-speech tagger, using a portion of the Brown corpus as training data:
data = list(nltk.corpus.brown.tagged_sents(categories="news"))
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('NN'))
twosidedtagger._train(train_sents)
Like all ngram taggers in the NLTK, this one will delegate to the backoff tagger if it is asked to tag an ngram it did not see during training.
For simplicity I used a simple "default tagger" as the backoff tagger, but you'll probably need to use something more powerful (see the NLTK chapter again).
You can then use your tagger to tag new text, or evaluate it with an already tagged test set:
>>> print(twosidedtagger.tag("There were dogs everywhere .".split()))
>>> print(twosidedtagger.evaluate(test_sents))
Predicting words:
The above tagger assigns a POS tag by considering nearby words; but your goal is to predict the word itself, so you need different training data and a different default tagger. The NLTK API expects training data in the form (word, LABEL), where LABEL is the value you want to generate. In your case, LABEL is just the current word itself; so make your training data as follows:
data = [ zip(s,s) for s in nltk.corpus.brown.sents(categories="news") ]
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('the')) # most common word
twosidedtagger._train(train_sents)
It makes no sense for the target word to appear in the "context" ngram, so you should also modify the method context() so that the returned ngram does not include it:
def context(self, tokens, index, history):
...
return tokens[index-left:index] + tokens[index+1:index+right+1]
This tagger uses trigrams consisting of two words from the left and one from the right of the current word.
With these modifications, you'll build a tagger that outputs the most likely word at any position. Try it and how you like it.
Prediction:
My expectation is that you'll need a humongous amount of training data before you can get decent performance. The problem is that ngram taggers can only suggest a tag for contexts they saw during training.
To build a tagger that generalizes, consider using the NLTK to train a "sequential classifier". You can use whatever features you want, including the words before and after-- of course, how well it will work is your problem. The NLTK classifier API is similar to that for the ContextTagger, but the context function (aka feature function) returns a dictionary, not a tuple. Again, see the NLTK book and the source code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.