Classification using movie review corpus in NLTK/Python

Classification using movie review corpus in NLTK/Python - python

I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt), but I would prefer to create directories I could dump files into.
from nltk.corpus import movie_reviews
reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words=nltk.FreqDist(
w.lower()
for w in movie_reviews.words()
if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation)
word_features = all_words.keys()[:100]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(movie_reviews.words('pos/11.txt'))
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
This returns:
File "test.py", line 38, in <module>
for w in movie_reviews.words()
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
self, self._resolve(fileids, categories))
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
in self.abspaths(fileids, True, True)])
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
---------UPDATE-------------
Thanks alvas for your detailed answer! I have two questions, however.
Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the review_pos.txt method, only grabbing the pos from the folder name rather than the file name.
I ran your code and am experiencing a syntax error on
train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in
documents[:numtrain]]
test_set = [({i:(i in tokens) for i in
word_features}, tag) for tokens,tag in documents[numtrain:]]
with the carrot under the first for. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.
----UPDATE 2----
Error is
File "review.py", line 17
for i in word_features}, tag)
^
SyntaxError: invalid syntax`

Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what's available in NLTK and what's not. So let's go through the problems one at a time.
Firstly, the way to get 'pos' / 'neg' documents through the directory is most probably the right thing to do, since the corpus was organized that way.
from nltk.corpus import movie_reviews as mr
from collections import defaultdict
documents = defaultdict(list)
for i in mr.fileids():
documents[i.split('/')[0]].append(i)
print documents['pos'][:10] # first ten pos reviews.
print
print documents['neg'][:10] # first ten neg reviews.
[out]:
['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']
Alternatively, I like a list of tuples where the first is element is the list of words in the .txt file and second is the category. And while doing so also remove the stopwords and punctuations:
from nltk.corpus import movie_reviews as mr
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
Next is the error at FreqDist(for w in movie_reviews.words() ...). There is nothing wrong with your code, just that you should try to use namespace (see http://en.wikipedia.org/wiki/Namespace#Use_in_common_languages). The following code:
from nltk.corpus import movie_reviews as mr
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
stop = stopwords.words('english')
all_words = FreqDist(w.lower() for w in mr.words() if w.lower() not in stop and w.lower() not in string.punctuation)
print all_words
[outputs]:
<FreqDist: 'film': 9517, 'one': 5852, 'movie': 5771, 'like': 3690, 'even': 2565, 'good': 2411, 'time': 2411, 'story': 2169, 'would': 2109, 'much': 2049, ...>
Since the above code prints the FreqDist correctly, the error seems like you do not have the files in nltk_data/ directory.
The fact that you have fic/11.txt suggests that you're using some older version of the NLTK or NLTK corpora. Normally the fileids in movie_reviews, starts with either pos/neg then a slash then the filename and finally .txt , e.g. pos/cv001_18431.txt.
So I think, maybe you should redownload the files with:
$ python
>>> import nltk
>>> nltk.download()
Then make sure that the movie review corpus is properly downloaded under the corpora tab:
Back to the code, looping through all the words in the movie review corpus seems redundant if you already have all the words filtered in your documents, so i would rather do this to extract all featureset:
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
Next, splitting the train/test by features is okay but i think it's better to use documents, so instead of this:
featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
I would recommend this instead:
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
Then feed the data into the classifier and voila! So here's the code without the comments and walkthrough:
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
[out]:
0.655
Most Informative Features
bad = True neg : pos = 2.0 : 1.0
script = True neg : pos = 1.5 : 1.0
world = True pos : neg = 1.5 : 1.0
nothing = True neg : pos = 1.5 : 1.0
bad = False pos : neg = 1.5 : 1.0

Related

How to save a scikit-learn classifier that utilizes a vectorizer, a pipeline and GridSearchV?

I built a sentiment classifier using the following steps:
load dataset with pandas
count = CountVectorizer()
bag = count.fit_transform(x)
bag.toarray()
tfidf = TfidfTransformer(use_idf=True, norm="l2",smooth_idf=True)
tfidf.fit_transform(bag).toarray()
from collections import Counter
vocab = Counter()
for text in x:
for word in text.split(" "):
vocab[word] += 1
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
vocab_reduced = Counter()
for w, c in vocab.items():
if not w in stop:
vocab_reduced[w]=c
def preprocessor(text):
""" Return a cleaned version of text
"""
# Remove HTML markup
text = re.sub('<[^>]*>', '', text)
# Save emoticons for later appending
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
# Remove any non-word character and append the emoticons,
# removing the nose character for standarization. Convert to lower case
text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
return text
from nltk.stem import PorterStemmer
porter = PorterStemmer()
def tokenizer(text):
return text.split()
def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'vect__preprocessor': [None, preprocessor],
'vect__use_idf':[False],
'vect__norm':[None],
"clf__alpha":[0,1],
"clf__fit_prior":[False,True]},
]
multi_tfidf = Pipeline([("vect", tfidf),
( "clf", MultinomialNB())])
gs_multi_tfidf = GridSearchCV(multi_tfidf, param_grid,
scoring="accuracy",
cv=5,
verbose=1,
n_jobs=-1)
gs_multi_tfidf.fit(X_train,y_train)
I tried saving the pipeline with joblib and saving both, the classifier as well as the pipeline to then use it for a website. But everytime I tried, it didn't work. I either got: ValueError: not enough values to unpack (expected 2, got 1) (when having saved both the pipeline and the classifier) or TypeError: 'module' object is not callable when just using the classifier.

Please try to use the following. Any specific reason why you do not include CountVectorizer() and TfidfTransformer() ? You should also exactly specify how you attempted to save the model.
multi_tfidf = Pipeline([("vect", TfidfVectorizer()),
( "clf", MultinomialNB())])

How to remove an entire line if it does not have a pos tag like CD?

I am reading a news article and pos-tagging with nltk. I want to remove those lines that does not have a pos tag like CD (numbers).
import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english'))
file1 = open("etorg.txt")
line = file1.read()
file1.close()
print(line)
words = line.split()
tokens = nltk.pos_tag(words)
How do I remove all sentences that do not contain the CD tag?

Just use [word for word in tokens if word[1] != 'CD']
EDIT: To get the sentences that have no numbers, use this code:
def has_number(sentence):
for i in nltk.pos_tag(sentence.split()):
if i[1] == 'CD':
return ''
return sentence
line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '
''.join([has_number(x) for x in line.split('.')])
> ' However, industry sources do not confirm this data '

Train customized corpus, with NLTK for Python

I try to train a corpus with my own documents. My documents are structured in the same way as the original movie_reviews corpus data, so 1K positive text files in folder 'pos' and 1K negative text files in folder 'neg'. Each textfile contains 25 lines of tweets, which are cleaned, as in: urls, usernames, capital letters, punctuation removed.
How can I adjust this code to use my own text data instead of the movie_reviews?
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np
# define the split of % training / % test
SPLIT = 0.8
def word_feats(words):
return dict([(word, True) for word in words])
posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
cutoff = int(len(posfeats) * SPLIT)
trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]
print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()

You can login as a root user and change you directory path to this:
/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py
In this document you can find already existing movie_reviews corpora loaded using LazyCorpusLoader:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Then try adding some thing similar to this:
My_Movie = LazyCorpusLoader(
'My_Movie', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Where My_Movie is the name which you have created for your movie reviews.
Once Everything is done save and exit.
Finally place you corpus in nltk directory where you can find the movie_review corpus.
Try performing this:
from nltk.corpus import My_Movie # Newly created you own corpus
Hope this will work.

NLTK stopword removal issue

I'm trying to do a document classification, as described in NLTK Chapter 6, and I'm having trouble removing stopwords. When I add
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
it returns
Traceback (most recent call last):
File "fiction.py", line 8, in <module>
word_features = all_words.keys()[:100]
AttributeError: 'generator' object has no attribute 'keys'
I'm guessing that the stopword code changed the type of object used for 'all_words', rendering they .key() function useless. How can I remove stopwords before using the key function without changing its type? Full code below:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './nltk_data/corpora/fiction'
fiction = PlaintextCorpusReader(corpus_root, '.*')
all_words=nltk.FreqDist(w.lower() for w in fiction.words())
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
word_features = all_words.keys()[:100]
def document_features(document): # [_document-classify-extractor]
document_words = set(document) # [_document-classify-set]
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(fiction.words('fic/11.txt'))

I would do this by avoiding adding them to the FreqDist instance in the first place:
all_words=nltk.FreqDist(w.lower() for w in fiction.words() if w.lower() not in nltk.corpus.stopwords.words('english'))
Depending on the size of your corpus I think you'd probably get a performance boost out of creating a set for the stopwords before doing that:
stopword_set = frozenset(ntlk.corpus.stopwords.words('english'))
If that's not suitable for your situation, it looks like you can take advantage of the fact that FreqDist inherits from dict:
for stopword in nltk.corpus.stopwords.words('english'):
if stopword in all_words:
del all_words[stopword]

NLTK classify interface using trained classifier

I have this little chunk of code I found here:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
But how can I classify a random word that might be in the corpus.
classifier.classify('magnificent')
Doesn't work. Does it need some kind of object?
Thank you very much.
EDIT: Thanks to #unutbu's feedback and some digging here and reading the comments on the original post the following yields 'pos' or 'neg' for this code (this one's a 'pos')
print(classifier.classify(word_feats(['magnificent'])))
and this yields the evaluation of the word for 'pos' or 'neg'
print(classifier.prob_classify(word_feats(['magnificent'])).prob('neg'))

print(classifier.classify(word_feats(['magnificent'])))
yields
pos
The classifier.classify method does not operate on individual words per se, it classifies based on a dict of features. In this example, word_feats maps a sentence (a list of words) to a dict of features.
Here is another example (from the NLTK book) which uses the NaiveBayesClassifier. By comparing what is similar and different between that example, and the one you posted, you may get a better perspective of how it can be used.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Classification using movie review corpus in NLTK/Python - python

Related

How to save a scikit-learn classifier that utilizes a vectorizer, a pipeline and GridSearchV?

How to remove an entire line if it does not have a pos tag like CD?

Train customized corpus, with NLTK for Python

NLTK stopword removal issue

NLTK classify interface using trained classifier

Categories

Resources