I'm trying to do some sentiment analysis of a new movie from Twitter using the NLTK toolkit. I've followed the NLTK 'movie_reviews' example and I've built my own CategorizedPlaintextCorpusReader object. The problem arises when I call nltk.classify.util.accuracy(classifier, testfeats). Here is the code:
import os
import glob
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
trainfeats = negfeats + posfeats
# Building a custom Corpus Reader
tweets = nltk.corpus.reader.CategorizedPlaintextCorpusReader('./tweets', r'.*\.txt', cat_pattern=r'(.*)\.txt')
tweetsids = tweets.fileids()
testfeats = [(word_feats(tweets.words(fileids=[f]))) for f in tweetsids]
print 'Training the classifier'
classifier = NaiveBayesClassifier.train(trainfeats)
for tweet in tweetsids:
print tweet + ' : ' + classifier.classify(word_feats(tweets.words(tweetsids)))
classifier.show_most_informative_features()
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
It all seems to work fine until it gets to the last line. That's when I get the error:
>>> nltk.classify.util.accuracy(classifier, testfeats)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs,l) in gold])
ValueError: too many values to unpack
Does anybody see anything wrong within the code?
Thanks.
The error message
File "/usr/lib/python2.7/dist-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs,l) in gold])
ValueError: too many values to unpack
arises because items in gold can not be unpacked into a 2-tuple, (fs,l):
[fs for (fs,l) in gold] # <-- The ValueError is raised here
It is the same error you would get if gold equals [(1,2,3)], since the 3-tuple (1,2,3) can not be unpacked into a 2-tuple (fs,l):
In [74]: [fs for (fs,l) in [(1,2)]]
Out[74]: [1]
In [73]: [fs for (fs,l) in [(1,2,3)]]
ValueError: too many values to unpack
gold might be buried inside the implementation of nltk.classify.util.accuracy, but this hints that your inputs, classifier or testfeats are of the wrong "shape".
There is no problem with classifer, since calling accuracy(classifier, trainfeats)
works:
In [61]: print 'accuracy:', nltk.classify.util.accuracy(classifier, trainfeats)
accuracy: 0.9675
The problem must be in testfeats.
Compare trainfeats with testfeats.
trainfeats[0] is a 2-tuple containing a dict and a classification:
In [63]: trainfeats[0]
Out[63]:
({u'!': True,
u'"': True,
u'&': True,
...
u'years': True,
u'you': True,
u'your': True},
'neg') # <--- Notice the classification, 'neg'
but testfeats[0] is just a dict, word_feats(tweets.words(fileids=[f])):
testfeats = [(word_feats(tweets.words(fileids=[f]))) for f in tweetsids]
So to fix this you would need to define testfeats to look more like trainfeats -- each dict returned by word_feats must be paired with a classification.
Related
I'm just trying to print my script. I have this problem, I have researched and read many answers and even adding .encode ('utf-8) still does not work.
import pandas
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
n_components = 30
n_top_words = 10
def print_top_words(model, feature_names, n_top_words):]
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
return message
text = pandas.read_csv('fr_pretraitement.csv', encoding = 'utf-8')
text_clean = text['liste2']
text_raw = text['liste1']
text_clean_non_empty = text_clean.dropna()
not_commas = text_raw.str.replace(',', '')
text_raw_list = not_commas.values.tolist()
text_clean_list = text_clean_non_empty.values.tolist()
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(text_clean_list)
tf_feature_names = tf_vectorizer.get_feature_names()
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(tf)
print('topics...')
print(print_top_words(lda, tf_feature_names, n_top_words))
document_topics = lda.fit_transform(tf)
topics = print_top_words(lda, tf_feature_names, n_top_words)
for i in range(len(topics)):
print("Topic {}:".format(i))
docs = np.argsort(document_topics[:, i])[::-1]
for j in docs[:300]:
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
print(cleans.encode('utf-8') + ',' + " ".join(text_raw_list[j].encode('utf-8').split(",")[:2]))
My output:
Traceback (most recent call last):
File "script.py", line 62, in
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
TypeError: a bytes-like object is required, not 'str'
Let's look at the line in which the error raised:
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
Let's go step by step:
text_clean_list[j] is of str type => no error until there
text_clean_list[j].encode('utf-8') is of bytes type => no error until there
text_clean_list[j].encode('utf-8').split(",") is wrong: the parameter "," passed to split() method is of str type, but it must have been of bytes type (because here split() is a method from a bytes object) => the error is raised, indicating that a bytes-like object is required, not 'str'.
Note: Replacing split(",") with split(b",") avoids the error (but it may not be the behavior you expect...)
cleans = " ".join(text_clean_list[j].encode('utf-8').split(",")[:2])
You are encoding the string inside text_clean_list[j] into the bytes but what about the split(",")?
"," still is a str. Now you are trying to split byte like object using a string.
Example:
a = "this,that"
>>> a.encode('utf-8').split(',')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
Edit
Solution:
1- One solution could be that don't encode your string object right now, just split first and then encode later on. Like in my example:
a = "this, that"
c = a.split(",")
cleans = [x.encode('utf-8') for x in c]
2- Just use a simple encoding of "," itself.
cleans = a.encode("utf-8").split("b")
Both yields same answer. It would be better if you could just come up with input and output examples.
Ok so I'm trying to create a program that tells me how positive or negative each line of that paulryan.txt file is. I'm using the opinion_lexicon, and the file is '_io.TextIOWrapper'
Is there something I can use instead of .words?
Other less important problem: any ideas how to make my WHOLE paulryan.txt file lowercase while keeping it tokenized by line? Thinking it won't give me an accurate positive or negative score if I don't make the whole thing lowercase because there are only lowercase words in the opinion_lexicon.
import nltk
from nltk.corpus import opinion_lexicon
from nltk.tokenize.simple import (LineTokenizer, line_tokenize)
poswords = set(opinion_lexicon.words("positive-words.txt"))
negwords = set(opinion_lexicon.words("negative-words.txt"))
f=open("paulryan.txt", "rU")
raw = f.read()
token= nltk.line_tokenize(raw)
print(token)
def finddemons():
for x in token:
y = token.words()
percpos = len([w for w in token if w in poswords ]) / len(y)
percneg = len([w for w in token if w in negwords ]) / len(y)
print(x, "pos:", round(percpos, 3), "neg:", round(percneg, 3))
finddemons()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in finddemons
AttributeError: 'list' object has no attribute 'words'
I suggest you to read the file line by line. Then , use word_ tokenize:
for line in f:
tokens = word_tokenize(line)
You are right about lowercase the text for searching in the lexicon :
for line in f:
tokens = word_tokenize(line.lower())
You could even try to lemmatize the tokens by using wordnet, because the opinion lexicon is not that rich in vocabulary. Especially if you use tweets, where words are often in different forms.
I am working on a project and I would like to use Latent Dirichlet Allocation in order to extract topics from a large amount of articles.
My code is this:
import gensim
import csv
import json
import glob
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from time import gmtime, strftime
tokenizer = RegexpTokenizer(r'\w+')
cachedStopWords = set(stopwords.words("english"))
body = []
processed = []
with open('/…/file.json') as j:
data = json.load(j)
for i in range(0,len(data)):
body.append(data[i]['text'].lower())
for entry in body:
row = tokenizer.tokenize(entry)
processed.append([word for word in row if word not in cachedStopWords])
dictionary = corpora.Dictionary(processed)
corpus = [dictionary.doc2bow(text) for text in processed]
lda = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50, update_every=1, passes=1)
topics = lda.show_topics(num_topics=50, num_words=8)
other_doc = "After being jailed for life in 1964, Nelson Mandela became a worldwide symbol of resistance to apartheid. But his opposition to racism began many years before."
print lda[other_doc]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/gensim/models/ldamodel.py", line 714, in __getitem__
gamma, _ = self.inference([bow])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site
packages/gensim/models/ldamodel.py", line 361, in inference ids = [id for id, _ in doc]
ValueError: need more than 1 value to unpack
I also tried to use LdaMulticore in 3 different ways :
lda = gensim.models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = gensim.models.ldamodel.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=100, workers=3)
And every time I got this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute ‘LdaMulticore'
Any ideas?
Thank you in advance.
You have to convert back to the phase space.
http://radimrehurek.com/gensim/tut3.html#similarity-interface
vec_bow = dictionary.doc2bow(other_doc.lower().split())
vec_lsi = lda[vec_bow] # convert the query to LSI space
I realize this is old, but I just had this same problem. You are probably pointing to an older version of Gensim. You have to make sure you're using version >= 0.10.2.
Update with "easy_install -U gensim" and then make sure your IDE is seeing the updated library.
I tried the code on "Natural language processing with python", but a type error occurred.
import nltk
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
word = word.lower()
suffix_fdist.inc(word[-1:])
suffix_fdist.inc(word[-2:])
suffix_fdist.inc(word[-3:])
common_suffixes = suffix_fdist.items()[:100]
def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
return features
pos_features('people')
the error is below:
Traceback (most recent call last):
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 323, in <module>
pos_features('people')
File "/home/wanglan/javadevelop/TestPython/src/FirstModule.py", line 321, in pos_features
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
TypeError: not all arguments converted during string formatting
Does anyone could help me find out where i am wrong?
suffix is a tuple, because .items() returns (key,value) tuples. When you use %, if the right hand side is a tuple, the values will be unpacked and substituted for each % format in order. The error you get is complaining that the tuple has more entries than % formats.
You probably want just the key (the actual suffix), in which case you should use suffix[0], or .keys() to only retrieve the dictionary keys.
I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.
I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.
I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier
word_features = list(all_words.keys())[:15000]
testing_set = featuresets[10000:]
training_set = featuresets[:10000]
nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)
nbclassifier.show_most_informative_features(30)
This produces around 45000 words and has an accuracy of 85%.
I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors
Traceback (most recent call last):
File "foo.py", line 108, in <module>
print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
X = self._vectorizer.transform(featuresets)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
return self._transform(X, fitting=False)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.
I don't understand why adding stemming and or removing stop words breaks the classifier?
Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:
short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, 'pos' ))
for r in short_neg.split('\n'):
documents.append( (r, 'neg' ))
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
for w in short_pos_words:
all_words.append(w.lower())
for w in short_neg_words:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
I kept running into this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte.
You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:
fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
for line in f:
pos_lines.append(line)
Unfortunately, then I started getting this error:
UnicodeError: UTF-16 stream does not start with BOM
I forget how, but I made this error go away too. Then I started getting the same error as your original question:
ValueError: Sample sequence X is empty.
When I printed the length of featuresets, I saw it was only 2.
print("Feature sets list length : ", len(featuresets))
After digging on this site, I found these two questions:
Delete every non utf-8 symbols froms string
'str' object has no attribute 'decode' in Python3
The first one didn't really help, but the second one solved my problem (Note: I'm using python-3).
I'm not one for one liners, but this worked for me:
pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]
I will update my github repo later this week with the full code for the nlp tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.