Loading a classifier using Pickle? - python

I am trying run a sentiment analysis. I have managed to use Naive Bayes through nltk to classify a corpus of negative and positive tweets. However I do not want to go through the process of running this classifier every time I run this program so I tried to use pickle to save, and then load into a different script the classifier. However when I try to run the script it returns the error NameError: name classifier is not defined, although I thought it was defined through the def load_classifier():
The code I have atm is below:
import nltk, pickle
from nltk.corpus import stopwords
customstopwords = ['']
p = open('xxx', 'r')
postxt = p.readlines()
n = open('xxx', 'r')
negtxt = n.readlines()
neglist = []
poslist = []
for i in range(0,len(negtxt)):
neglist.append('negative')
for i in range(0,len(postxt)):
poslist.append('positive')
postagged = zip(postxt, poslist)
negtagged = zip(negtxt, neglist)
taggedtweets = postagged + negtagged
tweets = []
for (word, sentiment) in taggedtweets:
word_filter = [i.lower() for i in word.split()]
tweets.append((word_filter, sentiment))
def getwords(tweets):
allwords = []
for (words, sentiment) in tweets:
allwords.extend(words)
return allwords
def getwordfeatures(listoftweets):
wordfreq = nltk.FreqDist(listoftweets)
words = wordfreq.keys()
return words
wordlist = [i for i in getwordfeatures(getwords(tweets)) if not i in stopwords.words('english')]
wordlist = [i for i in getwordfeatures(getwords(tweets)) if not i in customstopwords]
def feature_extractor(doc):
docwords = set(doc)
features = {}
for i in wordlist:
features['contains(%s)' % i] = (i in docwords)
return features
training_set = nltk.classify.apply_features(feature_extractor, tweets)
def load_classifier():
f = open('my_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close
return classifier
while True:
input = raw_input('I hate this film')
if input == 'exit':
break
elif input == 'informfeatures':
print classifier.show_most_informative_features(n=30)
continue
else:
input = input.lower()
input = input.split()
print '\nSentiment is ' + classifier.classify(feature_extractor(input)) + ' in that sentence.\n'
p.close()
n.close()
Any help would be great, the script seems to make it to the print '\nSentiment is ' + classifier.classify(feature_extractor(input)) + ' in that sentence.\n'" before returning the error...

Well, you have declared and defined the load_classifier() method but never called it and assigned a variable using it. That means, by the time, the execution reaches the print '\nSentiment is... ' line, there is no variable names classifier. Naturally, the execution throws an exception.
Add the line classifier = load_classifier() just before while loop. (without any indentation)

Related

only get the last result of the xml file

here is my code, I try to use the print function to check and I tag what I have found next to the code using#
def file(entry):
file_name = str(entry)
if file_name.endswith('.xml'):
tree = ET.parse(file_name)
root = tree.getroot()
for i in range(len(root)):
in_text = str(root[i][5].text).lower()
print(in_text)# here I still get all data
elif file_name.endswith('.json'):
with open(file_name) as f:
j_text = json.load(f)
in_text = (j_text['text']).lower()
else:
root_error = tk.Tk()
root_error.title('Error !')
canvas_error = tk.Canvas(root_error, height=10, width=100 )
canvas_error.pack()
label_error = tk.Label(root_error, text= 'file type dont support')
label_error.pack()
root_error.mainloop()
remove_digits = str.maketrans('', '', digits)
res = in_text.translate(remove_digits)
print(res)# here I get only the last one
token_text = sent_tokenize(res)
sent_string = ('\n'.join(token_text))
removed_pun = str(sent_string).translate(str.maketrans('', '', string.punctuation))
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(str(removed_pun))
result = [i for i in tokens if not i in stop_words]
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in result]
lemmatizer = WordNetLemmatizer()
final_text = ' '.join([lemmatizer.lemmatize(w) for w in stemmed])
lower_label_out['text'] = final_text
but when I use the code only like this
tree = ET.parse('books.xml')
root = tree.getroot()
for i in range(len(root)):
print(root[i][5].text)
I get all the data, I don't know why I only get the last data, how can I fix it
As written in the comment, your problem is that you overwrite the label['text'] value in each iteration. With the new indentation, you just shifted the problem from the out_text variable to the label['text'] variable. If you want to get a list of all out_texts, I'd suggest you to do the following.
out_text = []
for i in range(len(root)):
# in each iteration, append the new string to the list
out_text.append(str(root[i][0].text)
label_out['text'] = out_text
In each iteration, the value of str(root[i][0].text) is appended to the list, and finally assigned to the label_out['text'] value.
However, I'd suggest you to look into how for loops work in python, as you could write the same statement as follows:
out_text = []
for ro in root:
out_text.append(str(ro[0]).text)
label_out['text'] = out_text
The reason why the print() statement works is that you put it into the for loop, so each time the code passes there, the current value is printed to the screen.
The last line in your for lop is mis-indented, so it only prints the last element.
Try changing it to:
for i in range(len(root)):
out_text = str(root[i][0].text)
label_out['text'] = out_text #note the new indentation
and see if it works.

Get trouble to load glove 840B 300d vector

It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it.
But when I split them with the script below
import numpy as np
def loadGloveModel(gloveFile):
print "Loading Glove Model"
f = open(gloveFile,'r')
model = {}
for line in f:
splitLine = line.split()
word = splitLine[0]
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
print "Done.",len(model)," words loaded!"
return model
I load the glove 840B 300d.txt. but get error and I print the splitLine I got
['contact', 'name#domain.com', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]
or
['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]
Please notice that this script works fine in glove.6b.*
The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:
splitLine = line.split()
into
splitLine = line.split(' ')
So you code must be like this:
import numpy as np
def loadGloveModel(gloveFile):
print "Loading Glove Model"
f = open(gloveFile,'r', encoding='utf8')
model = {}
for line in f:
splitLine = line.split(' ')
word = splitLine[0]
embedding = np.asarray(splitLine[1:], dtype='float32')
model[word] = embedding
print "Done.",len(model)," words loaded!"
return model
I think the following may help:
def process_glove_line(line, dim):
word = None
embedding = None
try:
splitLine = line.split()
word = " ".join(splitLine[:len(splitLine)-dim])
embedding = np.array([float(val) for val in splitLine[-dim:]])
except:
print(line)
return word, embedding
def load_glove_model(glove_filepath, dim):
with open(glove_filepath, encoding="utf8" ) as f:
content = f.readlines()
model = {}
for line in content:
word, embedding = process_glove_line(line, dim)
if embedding is not None:
model[word] = embedding
return model
model= load_glove_model("glove.840B.300d.txt", 300)

How to find unique words for each text file in a bundle of text files using python?

How can I find only words that are unique to a text file? If a word is used frequently by in other files then it gets dropped.
Here is a reference http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html
I need a script which loops through all text files in a folder and outputs the results in Json format.
My code so far :
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from pprint import pprint as pp
from glob import glob
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import CountVectorizer
import codecs
import jinja2
import json
import os
def get_raw_data():
texts = []
for x in range(1,95):
file_name = str(x+1)+".txt"
with codecs.open(file_name,"rU","utf-8") as myfile:
data = myfile.read()
texts.append(data)
yield file_name, '\n'.join(texts)
class StemTokenizer(object):
def __init__(self):
self.ignore_set = {'footnote'}
def __call__(self, doc):
words = []
for word in word_tokenize(doc):
word = word.lower()
w = wn.morphy(word)
if w and len(w) > 1 and w not in self.ignore_set:
words.append(w)
return words
def process_text(counts, vectorizer, text, file_name, index):
result = {w: counts[index][vectorizer.vocabulary_.get(w)]
for w in vectorizer.get_feature_names()}
result = {w: c for w, c in result.iteritems() if c > 4}
normalizing_factor = max(c for c in result.itervalues())
result = {w: c / normalizing_factor
for w, c in result.iteritems()}
return result
def main():
data = list(get_raw_data())
print('Data loaded')
n = len(data)
vectorizer = CountVectorizer(stop_words='english', min_df=(n-1) / n,tokenizer=StemTokenizer())
counts = vectorizer.fit_transform(text for p, text in data).toarray()
print('Vectorization done.')
print (counts)
for x in range(95):
file_name = str(x+1)+".txt"
# print (text)
for i, (text) in enumerate(data):
print (file_name)
# print (text)
with codecs.open(file_name,"rU","utf-8") as myfile:
text = myfile.read()
result = process_text(counts, vectorizer, text, file_name, i)
print (result)
if __name__ == '__main__':
main()
Looks like you've got a bunch of files named 1.txt, 2.txt, ... 95.txt, and you want to find words that occur in one file only. I'd just gather all words, counting how many files each one occurs in; and print out the singletons.
from collections import Counter
import re
fileids = [ str(n+1)+".txt" for n in range(95) ]
filecounts = Counter()
for fname in fileids:
with open(fname) as fp: # Add encoding if really needed
text = fp.read().lower()
words = re.split(r"\W+", text) # Keep letters, drop the rest
filecounts.update(set(words))
singletons = [ word in filecounts if filecounts[word] == 1 ]
print(" ".join(singletons))
Done. You don't need scikit, you don't need the nltk, you don't need a pile of IR algorithms. You can use the list of singletons in an IR algorithm, but that's a different story.
def parseText():
# oFile: text file to test
# myWord: word we are looking for
# Get all lines into list
aLines = oFile.readlines()
# Perform list comprehension on lines to test if the word is found
for sLine in aLines:
# Parse the line (remove spaces), returns list
aLine = sLine.split()
# Iterate words and test to see if they match our word
for sWord in aLines:
# if it matches, append it to our list
if sWord == myWord: aWords.append( sWord )
# Create empty list to store all instances of the word that we may find
aWords = []
# Prompt user to know what word to search
myWord = str( raw_input( 'what word to searh:' ) )
# Call function
parseText()
# Check if list has at least one element
if len( aWords ) < 1: print 'Word not found in file'
else: print str( len( aWords ) ) + ' instances of our word found in file'

How to tweak the NLTK Python code in such a way that I train the classifier only once

I have tried performing Sentiment Analysis on a huge data set which is about 10000 sentences. Now, when I use the NLTK Python code for performing training and testing using Naive Bayes, I will have train the classifier each time when I need to classify a set of new sentences. This is taking a lot of time.Is there a way I can take the output of the training part and then use it for classification which would save a lot of time.This is the NLTK code that I have used.
import nltk
import re
import csv
#Read the tweets one by one and process it
def processTweet(tweet):
# process the tweets
#convert to lower case
tweet = tweet.lower()
#Convert www.* or https?://* to URL
tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
#Convert #username to AT_USER
tweet = re.sub('#[^\s]+','AT_USER',tweet)
#Remove additional white spaces
tweet = re.sub('[\s]+', ' ', tweet)
#Replace #word with word
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
#trim
tweet = tweet.strip('\'"')
return tweet
def replaceTwoOrMore(s):
#look for 2 or more repetitions of character and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
return pattern.sub(r"\1\1", s)
#end
#start getStopWordList
def getStopWordList(stopWordListFileName):
#read the stopwords file and build a list
stopWords = []
stopWords.append('AT_USER')
stopWords.append('url')
stopWords.append('URL')
stopWords.append('rt')
fp = open(stopWordListFileName)
line = fp.readline()
while line:
word = line.strip()
stopWords.append(word)
line = fp.readline()
fp.close()
return stopWords
#end
#start getfeatureVector
def getFeatureVector(tweet):
featureVector = []
#split tweet into words
words = tweet.split()
for w in words:
#replace two or more with two occurrences
w = replaceTwoOrMore(w)
#strip punctuation
w = w.strip('\'"?,.')
#check if the word starts with an alphabet
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w)
#ignore if it is a stop word
if(w in stopWords or val is None):
continue
else:
featureVector.append(w.lower())
return featureVector
#end
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
inpTweets = csv.reader(open('sheet3.csv', 'rb'), delimiter=',')
stopWords = getStopWordList('stopwords.txt')
featureList = []
# Get tweet words
tweets = []
for row in inpTweets:
sentiment = row[0]
tweet = row[1]
processedTweet = processTweet(tweet)
featureVector = getFeatureVector(processedTweet)
featureList.extend(featureVector)
tweets.append((featureVector, sentiment));
#end loop
# Remove featureList duplicates
featureList = list(set(featureList))
# Extract feature vector for all tweets in one shote
training_set = nltk.classify.util.apply_features(extract_features, tweets)
NBClassifier = nltk.NaiveBayesClassifier.train(training_set)
ft = open("april2.tsv")
line = ft.readline()
fo = open("dunno.tsv", "w")
fo.seek(0,0)
while line:
testTweet = line
processedTestTweet = processTweet(testTweet)
line1 = fo.write( NBClassifier.classify(extract_features(getFeatureVector(processedTestTweet))) + "\n");
line = ft.readline()
fo.close()
ft.close()
If you want to stick with NLTK, try pickle, e.g. https://spaghetti-tagger.googlecode.com/svn/spaghetti.py, see https://docs.python.org/2/library/pickle.html :
#-*- coding: utf8 -*-
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt
from cPickle import dump,load
def loadtagger(taggerfilename):
infile = open(taggerfilename,'rb')
tagger = load(infile); infile.close()
return tagger
def traintag(corpusname, corpus):
# Function to save tagger.
def savetagger(tagfilename,tagger):
outfile = open(tagfilename, 'wb')
dump(tagger,outfile,-1); outfile.close()
return
# Training UnigramTagger.
uni_tag = ut(corpus)
savetagger(corpusname+'_unigram.tagger',uni_tag)
# Training BigramTagger.
bi_tag = bt(corpus)
savetagger(corpusname+'_bigram.tagger',bi_tag)
print "Tagger trained with",corpusname,"using" +\
"UnigramTagger and BigramTagger."
return
Otherwise, try other machine learning libraries such as sklearn or shogun
The Naive Bayes Classifier module in NLTK is breathtakingly slow because it's a pure Python implementation. For this reason, consider using a different Machine Learning (ML) library like sci-kit learn.
YS-L's tip is good for using cPickle is good for your purposes at the moment but, if you ever have to retrain the classifier, it'd probably be best to switch to a different Naive Bayes implementation.

Using Binary Search for Spelling Check

I am trying to use binary search to check the spelling of words in a file, and print out the words that are not in the dictionary. But as of now, most of the correctly spelled words are being printed as misspelled (words that cannot be find in the dictionary).
Dictionary file is also a text file that looks like:
abactinally
abaction
abactor
abaculi
abaculus
abacus
abacuses
Abad
abada
Abadan
Abaddon
abaddon
abadejo
abadengo
abadia
Code:
def binSearch(x, nums):
low = 0
high = len(nums)-1
while low <= high:
mid = (low + high)//2
item = nums[mid]
if x == item :
print(nums[mid])
return mid
elif x < item:
high = mid - 1
else:
low = mid + 1
return -1
def main():
print("This program performs a spell-check in a file")
print("and prints a report of the possibly misspelled words.\n")
# get the sequence of words from the file
fname = input("File to analyze: ")
text = open(fname,'r').read()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
#import dictionary from file
fname2 =input("File of dictionary: ")
dic = open(fname2,'r').read()
dic = dic.split()
#perform binary search for misspelled words
misw = []
for w in words:
m = binSearch(w,dic)
if m == -1:
misw.append(w)
Your binary search works perfectly! You don't seem to be removing all special characters, though.
Testing your code (with a sentence of my own):
def main():
print("This program performs a spell-check in a file")
print("and prints a report of the possibly misspelled words.\n")
text = 'An old mann gathreed his abacus, and ran a mile. His abacus\n ran two miles!'
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.lower().split(' ')
dic = ['a','abacus','an','and','arranged', 'gathered', 'his', 'man','mile','miles','old','ran','two']
#perform binary search for misspelled words
misw = []
for w in words:
m = binSearch(w,dic)
if m == -1:
misw.append(w)
print misw
prints as output ['mann', 'gathreed', '', '', 'abacus\n', '']
Those extra empty strings '' are the extra spaces for punctuation that you replaced with spaces. The \n (a line break) is a little more problematic, as it is something you definitely see in external text files but is not something intuitive to account for. What you should do instead of for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_``{|}~': is just check to see if every character .isalpha() Try this:
def main():
...
text = 'An old mann gathreed his abacus, and ran a mile. His abacus\n ran two miles!'
for ch in text:
if not ch.isalpha() and not ch == ' ':
#we want to keep spaces or else we'd only have one word in our entire text
text = text.replace(ch, '') #replace with empty string (basically, remove)
words = text.lower().split(' ')
#import dictionary
dic = ['a','abacus','an','and','arranged', 'gathered', 'his', 'man','mile','miles','old','ran','two']
#perform binary search for misspelled words
misw = []
for w in words:
m = binSearch(w,dic)
if m == -1:
misw.append(w)
print misw
Output:
This program performs a spell-check in a file
and prints a report of the possibly misspelled words.
['mann', 'gathreed']
Hope this was helpful! Feel free to comment if you need clarification or something doesn't work.

Categories

Resources