I've been using the maxent classifier in python and its failing and I don't understand why.
I'm using the movie reviews corpus.
(total noob)
import nltk.classify.util
from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
classifier = MaxentClassifier.train(trainfeats)
This is the error (I know I'm doing this wrong please link to how Maxent works)
Warning (from warnings module):
File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line 1334
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
RuntimeWarning: invalid value encountered in multiply
Warning (from warnings module):
File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line 1335
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
RuntimeWarning: invalid value encountered in multiply
Warning (from warnings module):
File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line 1341
deltas -= (ffreq_empirical - sum1) / -sum2
RuntimeWarning: invalid value encountered in divide
I changed and update the code a bit.
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from sklearn import cross_validation
from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
#classifier = nltk.MaxentClassifier.train(trainfeats)
algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3)
classifier.show_most_informative_features(10)
all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])
def word_feats(words):
return {word:True for word in words if word in top_words}
There's probably a fix for the numpy overflow issue but since this is just a movie review classifier for learning NLTK / text classification (and you probably don't want training to take a long time anyway), I'll provide a simple workaround: you can just restrict the words used in feature sets.
You can find the 300 most commonly used words in all reviews like this (you can obviously make that higher if you want),
all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])
Then all you have to do is cross-reference top_words in your feature extractor for reviews. Also, just as a suggestion, it's more efficient to use dictionary comprehension rather than convert a list of tuples to a dict. So this might look like,
def word_feats(words):
return {word:True for word in words if word in top_words}
Related
I have a dataframe
0 i only need uxy to hit 20 eod to make up for a...
1 oh this isn’t good
2 lads why is my account covered in more red ink...
3 i'm tempted to drop my last 800 into some stup...
4 the sell offs will continue until moral improves.
I want to apply NLP for each comment to identify which one is positive and which one is negative.
Here is what I have
import pandas as pd
import numpy as np
from nltk.corpus import movie_reviews
from random import shuffle
from nltk import FreqDist
from nltk.corpus import stopwords
import string
from nltk import NaiveBayesClassifier
from nltk import classify
from nltk.tokenize import word_tokenize
df = pd.read_csv("/home/yan/PycharmProjects/pythonProject/comments_binary.csv")
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_reviews.append(words)
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_reviews.append(words)
stopwords_english = stopwords.words('english')
def bag_of_words(words):
words_clean = []
for word in words:
word = word.lower()
if word not in stopwords_english and word not in string.punctuation:
words_clean.append(word)
words_dictionary = dict([word, True] for word in words_clean)
return words_dictionary
# positive reviews feature set
pos_reviews_set = []
for words in pos_reviews:
pos_reviews_set.append((bag_of_words(words), 'pos'))
# negative reviews feature set
neg_reviews_set = []
for words in neg_reviews:
neg_reviews_set.append((bag_of_words(words), 'neg'))
shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
custom_review = "I am pretty sure that TSLA will hit 500 today after open"
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: pos
I am confused how would I apply the whole function for each row with text and create a separate column with pos and neg text that would describe a certain comment.
I tried to create a function
def my_classification(x):
return classifier.classify(x)
df["new_column"] = df["text"].apply(my_classification)
But it says AttributeError: 'str' object has no attribute 'copy'
I would highly appreciate your help
I try to train a corpus with my own documents. My documents are structured in the same way as the original movie_reviews corpus data, so 1K positive text files in folder 'pos' and 1K negative text files in folder 'neg'. Each textfile contains 25 lines of tweets, which are cleaned, as in: urls, usernames, capital letters, punctuation removed.
How can I adjust this code to use my own text data instead of the movie_reviews?
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from collections import defaultdict
import numpy as np
# define the split of % training / % test
SPLIT = 0.8
def word_feats(words):
return dict([(word, True) for word in words])
posids = movie_reviews.fileids('pos')
negids = movie_reviews.fileids('neg')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
cutoff = int(len(posfeats) * SPLIT)
trainfeats = negfeats[:cutoff] + posfeats[:cutoff]
testfeats = negfeats[cutoff:] + posfeats[cutoff:]
print 'Train on %d instances\nTest on %d instances' % (len(trainfeats),len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'Accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
You can login as a root user and change you directory path to this:
/usr/local/lib/python2.7/dist-packages/nltk/corpus/__init__.py
In this document you can find already existing movie_reviews corpora loaded using LazyCorpusLoader:
movie_reviews = LazyCorpusLoader(
'movie_reviews', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Then try adding some thing similar to this:
My_Movie = LazyCorpusLoader(
'My_Movie', CategorizedPlaintextCorpusReader,
r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*')
Where My_Movie is the name which you have created for your movie reviews.
Once Everything is done save and exit.
Finally place you corpus in nltk directory where you can find the movie_review corpus.
Try performing this:
from nltk.corpus import My_Movie # Newly created you own corpus
Hope this will work.
I am doing a classification task on tweets (3 labels= pos, neg, neutral), for which I'm using Naive Bayes in NLTK. I'd like to add in ngrams (bigrams) as well. I have tried adding them to the code, but I don't seem to get where to fit them right in. At the moment it seems as if I'm "breaking" the code, no matter where I add in the bigrams. Could anybody please help me out, or redirect me to a tutorial?
My code for unigrams follows. If you need any information on how the datasets look, I'd be happy to provide it.
import nltk
import csv
import random
import nltk.classify.util, nltk.metrics
import codecs
import re, math, collections, itertools
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.util import ngrams
from nltk import bigrams
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords = True)
stopset = set(stopwords.words('english'))
stopset.add('username')
stopset.add('url')
stopset.add('percentage')
stopset.add('number')
stopset.add('at_user')
stopset.add('AT_USER')
stopset.add('URL')
stopset.add('percentagenumber')
inpTweets = []
##with open('sanders.csv', 'r', 'utf-8') as f: #input sanders
## reader = csv.reader(f, delimiter = ';')
## for row in reader:
## inpTweets.append((row))
reader = codecs.open('...sanders.csv', 'r', encoding='utf-8-sig') #input classified tweets
for line in reader:
line = line.rstrip()
row = line.split(';')
inpTweets.append((row))
def processTweet(tweet):
tweet = tweet.lower()
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
tweet = re.sub('#[^\s]+','AT_USER',tweet)
tweet = re.sub('[\s]+', ' ', tweet)
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
tweet = tweet.strip('\'"')
return tweet
def replaceTwoOrMore(s):
#look for 2 or more repetitions of character and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
return pattern.sub(r"\1\1", s)
def preprocessing(doc):
tokens = tokenizer.tokenize(doc)
bla = []
for x in tokens:
if len(x)>2:
if x not in stopset:
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", x)
if val is not None:
x = replaceTwoOrMore(x)
x = processTweet(x)
x = x.strip('\'"?,.')
x = stemmer.stem(x).lower()
bla.append(x)
return bla
xyz = []
for lijn in inpTweets:
xyz.append((preprocessing (lijn[0]),lijn[1]))
random.shuffle(xyz)
featureList = []
k = 0
while k in range (0, len(xyz)):
featureList.extend(xyz[k][0])
k = k + 1
fd = nltk.FreqDist(featureList)
featureList = list(fd.keys())[2000:]
def document_features(doc):
features = {}
document_words = set(doc)
for word in featureList:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = nltk.classify.util.apply_features(document_features, xyz)
training_set, test_set = featuresets[2000:], featuresets[:2000]
classifier = nltk.NaiveBayesClassifier.train(training_set)
Your code uses the 2000 most common words as the classification features. Just select the bigrams you want to use, and convert them to features in document_features(). A feature like "contains (the dog)" will work just like "contains (dog)".
An interesting approach is using a sequential backoff tagger, which allows you to chain taggers together: in this way you could train a n-gram tagger and a Naive Bayes and chain them togheter.
I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt), but I would prefer to create directories I could dump files into.
from nltk.corpus import movie_reviews
reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words=nltk.FreqDist(
w.lower()
for w in movie_reviews.words()
if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation)
word_features = all_words.keys()[:100]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(movie_reviews.words('pos/11.txt'))
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
This returns:
File "test.py", line 38, in <module>
for w in movie_reviews.words()
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
self, self._resolve(fileids, categories))
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
in self.abspaths(fileids, True, True)])
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
---------UPDATE-------------
Thanks alvas for your detailed answer! I have two questions, however.
Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the review_pos.txt method, only grabbing the pos from the folder name rather than the file name.
I ran your code and am experiencing a syntax error on
train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in
documents[:numtrain]]
test_set = [({i:(i in tokens) for i in
word_features}, tag) for tokens,tag in documents[numtrain:]]
with the carrot under the first for. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.
----UPDATE 2----
Error is
File "review.py", line 17
for i in word_features}, tag)
^
SyntaxError: invalid syntax`
Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what's available in NLTK and what's not. So let's go through the problems one at a time.
Firstly, the way to get 'pos' / 'neg' documents through the directory is most probably the right thing to do, since the corpus was organized that way.
from nltk.corpus import movie_reviews as mr
from collections import defaultdict
documents = defaultdict(list)
for i in mr.fileids():
documents[i.split('/')[0]].append(i)
print documents['pos'][:10] # first ten pos reviews.
print
print documents['neg'][:10] # first ten neg reviews.
[out]:
['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']
Alternatively, I like a list of tuples where the first is element is the list of words in the .txt file and second is the category. And while doing so also remove the stopwords and punctuations:
from nltk.corpus import movie_reviews as mr
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
Next is the error at FreqDist(for w in movie_reviews.words() ...). There is nothing wrong with your code, just that you should try to use namespace (see http://en.wikipedia.org/wiki/Namespace#Use_in_common_languages). The following code:
from nltk.corpus import movie_reviews as mr
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
stop = stopwords.words('english')
all_words = FreqDist(w.lower() for w in mr.words() if w.lower() not in stop and w.lower() not in string.punctuation)
print all_words
[outputs]:
<FreqDist: 'film': 9517, 'one': 5852, 'movie': 5771, 'like': 3690, 'even': 2565, 'good': 2411, 'time': 2411, 'story': 2169, 'would': 2109, 'much': 2049, ...>
Since the above code prints the FreqDist correctly, the error seems like you do not have the files in nltk_data/ directory.
The fact that you have fic/11.txt suggests that you're using some older version of the NLTK or NLTK corpora. Normally the fileids in movie_reviews, starts with either pos/neg then a slash then the filename and finally .txt , e.g. pos/cv001_18431.txt.
So I think, maybe you should redownload the files with:
$ python
>>> import nltk
>>> nltk.download()
Then make sure that the movie review corpus is properly downloaded under the corpora tab:
Back to the code, looping through all the words in the movie review corpus seems redundant if you already have all the words filtered in your documents, so i would rather do this to extract all featureset:
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
Next, splitting the train/test by features is okay but i think it's better to use documents, so instead of this:
featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
I would recommend this instead:
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
Then feed the data into the classifier and voila! So here's the code without the comments and walkthrough:
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
[out]:
0.655
Most Informative Features
bad = True neg : pos = 2.0 : 1.0
script = True neg : pos = 1.5 : 1.0
world = True pos : neg = 1.5 : 1.0
nothing = True neg : pos = 1.5 : 1.0
bad = False pos : neg = 1.5 : 1.0
I have this little chunk of code I found here:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
classifier.show_most_informative_features()
But how can I classify a random word that might be in the corpus.
classifier.classify('magnificent')
Doesn't work. Does it need some kind of object?
Thank you very much.
EDIT: Thanks to #unutbu's feedback and some digging here and reading the comments on the original post the following yields 'pos' or 'neg' for this code (this one's a 'pos')
print(classifier.classify(word_feats(['magnificent'])))
and this yields the evaluation of the word for 'pos' or 'neg'
print(classifier.prob_classify(word_feats(['magnificent'])).prob('neg'))
print(classifier.classify(word_feats(['magnificent'])))
yields
pos
The classifier.classify method does not operate on individual words per se, it classifies based on a dict of features. In this example, word_feats maps a sentence (a list of words) to a dict of features.
Here is another example (from the NLTK book) which uses the NaiveBayesClassifier. By comparing what is similar and different between that example, and the one you posted, you may get a better perspective of how it can be used.