nltk stemming and stop words for naive bayes

nltk stemming and stop words for naive bayes - python

I'm looking to understand why using stemming and stop words, results in worse results in my naive bayes classifier.
I have two files, positive and negative reviews, both of which have around 200 lines but with many words, possibly with 5000 words per line.
I have the following code that creates a bag of words and then I create two feature sets for training and testing, then I run it against the nltk classifier
word_features = list(all_words.keys())[:15000]
testing_set = featuresets[10000:]
training_set = featuresets[:10000]
nbclassifier = nltk.NaiveBayesClassifier.train(training_set)
print((nltk.classify.accuracy(nbclassifier, testing_set))*100)
nbclassifier.show_most_informative_features(30)
This produces around 45000 words and has an accuracy of 85%.
I've looked at adding stemming (PorterStemmer) and removing stop words in my training data, but when I run the classifier again I now get 205 words and a 0% accuracy in my classifier and while testing other classifiers the script generates errors
Traceback (most recent call last):
File "foo.py", line 108, in <module>
print((nltk.classify.accuracy(MNB_classifier, testing_set))*100)
File "/Library/Python/2.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 83, in classify_many
X = self._vectorizer.transform(featuresets)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 293, in transform
return self._transform(X, fitting=False)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 184, in _transform
raise ValueError("Sample sequence X is empty.")
ValueError: Sample sequence X is empty.
I don't understand why adding stemming and or removing stop words breaks the classifier?

Adding stemming or removing stop words could not cause your issue. I think you have an issue further up in your code due to how you read the file. When I was following sentdex's tutorial on YouTube, I came across this same error. I was stuck for the past hour, but I finally got it. If you follow his code you get this:
short_pos = open("short_reviews/positive.txt", "r").read()
short_neg = open("short_reviews/negative.txt", "r").read()
documents = []
for r in short_pos.split('\n'):
documents.append( (r, 'pos' ))
for r in short_neg.split('\n'):
documents.append( (r, 'neg' ))
all_words = []
short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)
for w in short_pos_words:
all_words.append(w.lower())
for w in short_neg_words:
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:5000]
I kept running into this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6056: invalid start byte.
You get this error because there are non UTF-8 characters in the files provided. I was able to get around the error by changing the code to this:
fname = 'short_reviews/positive.txt'
with open(fname, 'r', encoding='utf-16') as f:
for line in f:
pos_lines.append(line)
Unfortunately, then I started getting this error:
UnicodeError: UTF-16 stream does not start with BOM
I forget how, but I made this error go away too. Then I started getting the same error as your original question:
ValueError: Sample sequence X is empty.
When I printed the length of featuresets, I saw it was only 2.
print("Feature sets list length : ", len(featuresets))
After digging on this site, I found these two questions:
Delete every non utf-8 symbols froms string
'str' object has no attribute 'decode' in Python3
The first one didn't really help, but the second one solved my problem (Note: I'm using python-3).
I'm not one for one liners, but this worked for me:
pos_lines = [line.rstrip('\n') for line in open('short_reviews/positive.txt', 'r', encoding='ISO-8859-1')]
I will update my github repo later this week with the full code for the nlp tutorial if you'd like to see the complete solution. I realize this answer probably comes 2 years too late, but hopefully it helps.

Related

struct.error: unpack requires a string argument of length 12

I am trying to follow a tutorial from Coding Robin to create a HAAR classifier: http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html.
I am at the part where I need to merge all the .vec files. I am trying to execute the python script given and I am getting the following error:
Traceback (most recent call last):
File "mergevec.py", line 170, in <module>
merge_vec_files(vec_directory, output_filename)
File "mergevec.py", line 133, in merge_vec_files
val = struct.unpack('<iihh', content[:12])
struct.error: unpack requires a string argument of length 12
Here is the code from the python script:
# Get the value for the first image size
prev_image_size = 0
try:
with open(files[0], 'rb') as vecfile:
content = ''.join(str(line) for line in vecfile.readlines())
val = struct.unpack('<iihh', content[:12])
prev_image_size = val[1]
except IOError as e:
print('An IO error occured while processing the file: {0}'.format(f))
exception_response(e)
# Get the total number of images
total_num_images = 0
for f in files:
try:
with open(f, 'rb') as vecfile:
content = ''.join(str(line) for line in vecfile.readlines())
val = struct.unpack('<iihh', content[:12])
num_images = val[0]
image_size = val[1]
if image_size != prev_image_size:
err_msg = """The image sizes in the .vec files differ. These values must be the same. \n The image size of file {0}: {1}\n
The image size of previous files: {0}""".format(f, image_size, prev_image_size)
sys.exit(err_msg)
total_num_images += num_images
except IOError as e:
print('An IO error occured while processing the file: {0}'.format(f))
exception_response(e)
I tried looking through solutions, but can't find a solution that fits this specific problem. Any help will be appreciated.
Thank you!

I figured it out by going to the github page for the tutorial. Apparently, I had to delete any vec files that had a length of zero.

Your problem is this bit:
content[:12]
The string is not guaranteed to be 12 characters long; it could be fewer. Add a length check and handle it separately, or try: except: and give the user a saner error message like "Invalid input in file ...".

Python .words issue?

Ok so I'm trying to create a program that tells me how positive or negative each line of that paulryan.txt file is. I'm using the opinion_lexicon, and the file is '_io.TextIOWrapper'
Is there something I can use instead of .words?
Other less important problem: any ideas how to make my WHOLE paulryan.txt file lowercase while keeping it tokenized by line? Thinking it won't give me an accurate positive or negative score if I don't make the whole thing lowercase because there are only lowercase words in the opinion_lexicon.
import nltk
from nltk.corpus import opinion_lexicon
from nltk.tokenize.simple import (LineTokenizer, line_tokenize)
poswords = set(opinion_lexicon.words("positive-words.txt"))
negwords = set(opinion_lexicon.words("negative-words.txt"))
f=open("paulryan.txt", "rU")
raw = f.read()
token= nltk.line_tokenize(raw)
print(token)
def finddemons():
for x in token:
y = token.words()
percpos = len([w for w in token if w in poswords ]) / len(y)
percneg = len([w for w in token if w in negwords ]) / len(y)
print(x, "pos:", round(percpos, 3), "neg:", round(percneg, 3))
finddemons()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in finddemons
AttributeError: 'list' object has no attribute 'words'

I suggest you to read the file line by line. Then , use word_ tokenize:
for line in f:
tokens = word_tokenize(line)
You are right about lowercase the text for searching in the lexicon :
for line in f:
tokens = word_tokenize(line.lower())
You could even try to lemmatize the tokens by using wordnet, because the opinion lexicon is not that rich in vocabulary. Especially if you use tweets, where words are often in different forms.

How to shuffle words in word2vec [duplicate]

This question already has answers here:
Why does random.shuffle return None?
(5 answers)
Closed 5 months ago.
I have this piece of code:
import gensim
import random
file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')
read_data = file.read()
data = read_data.split('\n')
sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])
model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)
for epoch in range(5):
shuffled_sentences = random.shuffle(sentences)
model.train(shuffled_sentences)
print(epoch)
print(model)
model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')
If I print a single sentence, then it output is something like this:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
What I need is to shuffle the words before training and then save the model.
I am not sure whether I am coding it in a right way. I end up with exception:
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
for sent_idx, sentence in enumerate(sentences):
File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
for document in self.corpus:
TypeError: 'NoneType' object is not iterable
I would like to ask you how can I shuffle words.

Random.shuffle shuffles the list inplace and returns none. For this reason your shuffled sentences are None after this call.

model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
random.shuffle(sentences)
perm_sentences = [sentences_list[i] for i in Idx]
model.train(perm_sentences)
print(epoch)
print(model)
model.save("somefile'.model')
This solves my problem.
But how can shuffle individual words in a sentence?
Sentence:
['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']
my objective is:
If I check for most similar word for, let say ['JO_3787672'], then every time it will predict words starting from 'JO_'. and the words starting from 'TA_' and 'TI_' have really less similarity score.
I suspected that, this is because of the words position in the data(I am not sure). That is why I try to do shuffling between word( I am really not sure whether it helps or not).

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

I'm new to programming, and experimenting with Python 3. I've found a few topics which deal with IndexError but none that seem to help with this specific circumstance.
I've written a function which opens a text file, reads it one line at a time, and slices the line up into individual strings which are each appended to a particular list (one list per 'column' in the record line). Most of the slices are multiple characters [x:y] but some are single characters [x].
I'm getting an IndexError: string index out of range message, when as far as I can tell, it isn't. This is the function:
def read_recipe_file():
recipe_id = []
recipe_book = []
recipe_name = []
recipe_page = []
ingred_1 = []
ingred_1_qty = []
ingred_2 = []
ingred_2_qty = []
ingred_3 = []
ingred_3_qty = []
f = open('recipe-file.txt', 'r') # open the file
for line in f:
# slice out each component of the record line and store it in the appropriate list
recipe_id.append(line[0:3])
recipe_name.append(line[3:23])
recipe_book.append(line[23:43])
recipe_page.append(line[43:46])
ingred_1.append(line[46])
ingred_1_qty.append(line[47:50])
ingred_2.append(line[50])
ingred_2_qty.append(line[51:54])
ingred_3.append(line[54])
ingred_3_qty.append(line[55:])
f.close()
return recipe_id, recipe_name, recipe_book, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, \
ingred_3_qty
This is the traceback:
Traceback (most recent call last):
File "recipe-test.py", line 84, in <module>
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, ingred_3, ingred_3_qty = read_recipe_file()
File "recipe-test.py", line 27, in read_recipe_file
ingred_1.append(line[46])
The code which calls the function in question is:
print('To show list of recipes: 1')
print('To add a recipe: 2')
user_choice = input()
recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty, ingred_2, ingred_2_qty, \
ingred_3, ingred_3_qty = read_recipe_file()
if int(user_choice) == 1:
print_recipe_table(recipe_id, recipe_book, recipe_name, recipe_page, ingred_1, ingred_1_qty,
ingred_2, ingred_2_qty, ingred_3, ingred_3_qty)
elif int(user_choice) == 2:
#code to add recipe
The failing line is this:
ingred_1.append(line[46])
There are more than 46 characters in each line of the text file I am trying to read, so I don't understand why I'm getting an out of bounds error (a sample line is below). If I change to the code to this:
ingred_1.append(line[46:])
to read a slice, rather than a specific character, the line executes correctly, and the program fails on this line instead:
ingred_2.append(line[50])
This leads me to think it is somehow related to appending a single character from the string, rather than a slice of multiple characters.
Here is a sample line from the text file I am reading:
001Cheese on Toast Meals For Two 012120038005002
I should probably add that I'm well aware this isn't great code overall - there are lots of ways I could generally improve the program, but as far as I can tell the code should actually work.

This will happen if some of the lines in the file are empty or at least short. A stray newline at the end of the file is a common cause, since that comes up as an extra blank line. The best way to debug a case like this is to catch the exception, and investigate the particular line that fails (which almost certainly won't be the sample line you reproduced):
try:
ingred_1.append(line[46])
except IndexError:
print(line)
print(len(line))
Catching this exception is also usually the right way to deal with the error: you've detected a pathological case, and now you can consider what to do. You might for example:
continue, which will silently skip processing that line,
Log something and then continue
Bail out by raising a new, more topical exception: eg raise ValueError("Line too short").
Printing something relevant, with or without continuing, is almost always a good idea if this represents a problem with the input file that warrants fixing. Continuing silently is a good option if it is something relatively trivial, that you know can't cause flow-on errors in the rest of your processing. You may want to differentiate between the "too short" and "completely empty" cases by detecting the "completely empty" case early such as by doing this at the top of your loop:
if not line:
# Skip blank lines
continue
And handling the error for the other case appropriately.
The reason changing it to a slice works is because string slices never fail. If both indexes in the slice are outside the string (in the same direction), you will get an empty string - eg:
>>> 'abc'[4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range
>>> 'abc'[4:]
''
>>> 'abc'[4:7]
''

Your code fails on line[46] because line contains fewer than 47 characters. The slice operation line[46:] still works because an out-of-range string slice returns an empty string.
You can verify that the line is too short by replacing
ingred_1.append(line[46])
with
try:
ingred_1.append(line[46])
except IndexError:
print('line = "%s", length = %d' % (line, len(line)))

wordcount: reducer python program throws ValueError

I get this error whenever I try running Reducer python program in Hadoop system. The Mapper program is perfectly running though. Have given the same permissions as my Mapper program. Is there a syntax error?
Traceback (most recent call last):
File "reducer.py", line 13, in
word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack
#!/usr/bin/env python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )

The error ValueError: need more than 1 value to unpack is thrown when you do a multi-assign with too few values on the right hand side. So it looks like line has no \t in it, so line.split('\t',1) results in a single value, causing something like word, count = ("foo",).

I cannot answer in detail.
However, I solved the same issue I had when I removed some extra print I had added in the mapper. Probably it is related with how print works for sys.stdin.
I know probably you have already solved the issue now

I changed line.split('\t', 1) to line.split(' ', 1) and it worked.
It seems that the space is not clear, to be perfectly clear: It should be line.split('(one space here)', 1).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

nltk stemming and stop words for naive bayes - python

Related

struct.error: unpack requires a string argument of length 12

Python .words issue?

How to shuffle words in word2vec [duplicate]

Why am I getting an IndexError in Python 3 when indexing a string and not slicing?

wordcount: reducer python program throws ValueError

Categories

Resources