NLTK stopword removal issue - python

I'm trying to do a document classification, as described in NLTK Chapter 6, and I'm having trouble removing stopwords. When I add
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
it returns
Traceback (most recent call last):
File "fiction.py", line 8, in <module>
word_features = all_words.keys()[:100]
AttributeError: 'generator' object has no attribute 'keys'
I'm guessing that the stopword code changed the type of object used for 'all_words', rendering they .key() function useless. How can I remove stopwords before using the key function without changing its type? Full code below:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './nltk_data/corpora/fiction'
fiction = PlaintextCorpusReader(corpus_root, '.*')
all_words=nltk.FreqDist(w.lower() for w in fiction.words())
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
word_features = all_words.keys()[:100]
def document_features(document): # [_document-classify-extractor]
document_words = set(document) # [_document-classify-set]
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(fiction.words('fic/11.txt'))

I would do this by avoiding adding them to the FreqDist instance in the first place:
all_words=nltk.FreqDist(w.lower() for w in fiction.words() if w.lower() not in nltk.corpus.stopwords.words('english'))
Depending on the size of your corpus I think you'd probably get a performance boost out of creating a set for the stopwords before doing that:
stopword_set = frozenset(ntlk.corpus.stopwords.words('english'))
If that's not suitable for your situation, it looks like you can take advantage of the fact that FreqDist inherits from dict:
for stopword in nltk.corpus.stopwords.words('english'):
if stopword in all_words:
del all_words[stopword]

Related

AttributeError when building list comprehension for Wordnet.Synsets().Definition()

First off, I'm a python noob and I only half-undestand how some of this stuff works. I've been trying to build word matrices for a tagging project and I hoped I could figure this out on my own, but I'm not seeing a lot of documentation around my particular error. So I apologize up front if this is something super-obvious.
I've tried to get a set of functions to work in a few different variations, but I keep getting "AttributeError: 'list' has no attribute definition."
import pandas as pd
from pandas import DataFrame, Series
import nltk.data
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer
# Gets synsets for a given term.
def get_synset(word):
for word in wn.synsets(word):
return word.name()
#Gets definitions for a synset.
def get_def(syn):
return wn.synsets(syn).defnition()
# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
sentences = frame.tolist()
tok_list = [tok.tokenize(w) for w in frame]
split_words = [w.lower() for sub in tok_list for w in sub]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
sector_matrix = DataFrame({'Categories': clean_words,
'Synsets': synset})
sec_syn = sector_matrix['Synsets'].tolist()
sector_matrix['Definition'] = [get_def(w) for w in sector_matrix['Synsets']]
return sector_matrix
The functions get called on a dataframe that I read in from excel:
test = pd.read_excel('data.xlsx')
And the sector_tagger function is called as such:
agri_matrix = sector_tagger(agri['Category'])
A previous version called wn.synsets(w).definition() in a list comprehension that populated the DataFrame. Another tried to call the definition after the fact in a Jupyter Notebook. I almost always get the Attribute Error. That said, when I call the datatype on sector_matrix['Synsets'] I get an "object" type, and when I print that column I don't see [] around the items.
I've tried:
Wrapping "w" in str()
Calling the list comprehension in and out of
the function (ie - deleting the line and calling it in my notebook)
Passing the 'Synsets' column to a new list and building a list comprehension around that
Curiously enough, I was playing around with this yesterday and was able to make something work in my notebook directly, but (a) it's messy (b) there's no scalability, and (c) it doesn't work on other categories that I apply it to.
agrimask = (df['Agri-Food']==1) & (df['Total']==1)
df_agri = df.loc[agrimask,['Category']]
agri_words = [tok.tokenize(a) for a in df_agri['Category']]
agri_cip_words = [a.lower() for sub in agri_words for a in sub]
agri_clean = [w for w in agri_cip_words if w not in english_stops]
df_agri_clean = DataFrame({'Category': agri_clean})
df_agri_clean = df_agri_clean[df_agri_clean != ','].replace('horticulture/horticultural','horticulture').dropna().drop_duplicates()
df_agri_clean['Synsets'] = [x[0].name() for x in df_agri_clean['Category'].apply(syn)]
df_agri_clean['Definition'] = [wn.synset(x).definition() for x in df_agri_clean['Synsets']]
df_agri_clean['Lemma'] = [wn.synset(x).lemmas()[0].name() for x in df_agri_clean['Synsets']]
df_agri_clean
Edit1: Here's a link to a sample of the data.
Edit2: Also, the static variables I'm using are here (all based around the standard NLTK library):
tok = TreebankWordTokenizer()
english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))
Edit3: You can see a working version of this code here: Working Code
2018-09-18_CIP.ipynb
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import TreebankWordTokenizer as tok
english_stops = set(stopwords.words('english'))
# Gets synsets for a given term.
def get_synset(word):
for word in wn.synsets(word):
return word.name()
#Gets definitions for a synset.
def get_def(syn):
return wn.synset(syn).definition() # your definition is misspelled
# Creates a dataframe called sector_matrix based on another dataframe's column. Should be followed with an export.
def sector_tagger(frame):
tok_list = tok().tokenize(frame)
split_words = [w.lower() for w in tok_list]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
sector_matrix = pd.DataFrame({'Categories': clean_words,
'Synsets': synset})
sec_syn = list(sector_matrix['Synsets'])
sector_matrix['Definition'] = [get_def(w) if w != None else '' for w in sec_syn]
return sector_matrix
agri_matrix = df['Category'].apply(sector_tagger)
if this answers your question, please check it as the answer
The output of get_def is a list of phrases
Alternate Approach
def sector_tagger(frame):
mapping = [('/', ' '), ('(', ''), (')', ''), (',', '')]
for k, v in mapping:
frame = frame.replace(k, v)
tok_list = tok().tokenize(frame) # note () after tok
split_words = [w.lower() for w in tok_list]
clean_words = [w for w in split_words if w not in english_stops]
synset = [get_synset(w) for w in clean_words]
def_matrix = [get_def(w) if w != None else '' for w in synset]
return clean_words, synset, def_matrix
poo = df['Category'].apply(sector_tagger)
poo[0] =
(['agricultural', 'domestic', 'animal', 'services'],
['agricultural.a.01', 'domestic.n.01', 'animal.n.01', 'services.n.01'],
['relating to or used in or promoting agriculture or farming',
'a servant who is paid to perform menial tasks around the household',
'a living organism characterized by voluntary movement',
'performance of duties or provision of space and equipment helpful to others'])
list_clean_words = []
list_synset = []
list_def_matrix = []
for x in poo:
list_clean_words.append(x[0])
list_synset.append(x[1])
list_def_matrix.append(x[2])
agri_matrix = pd.DataFrame()
agri_matrix['Categories'] = list_clean_words
agri_matrix['Synsets'] = list_synset
agri_matrix['Definition'] = list_def_matrix
agri_matrix
Categories Synsets Definition
0 [agricultural, domestic, animal, services] [agricultural.a.01, domestic.n.01, animal.n.01... [relating to or used in or promoting agricultu...
1 [agricultural, food, products, processing] [agricultural.a.01, food.n.01, merchandise.n.0... [relating to or used in or promoting agricultu...
2 [agricultural, business, management] [agricultural.a.01, business.n.01, management.... [relating to or used in or promoting agricultu...
3 [agricultural, mechanization] [agricultural.a.01, mechanization.n.01] [relating to or used in or promoting agricultu...
4 [agricultural, production, operations] [agricultural.a.01, production.n.01, operation... [relating to or used in or promoting agricultu...
Split each list of lists into a long list (they're ordered)
def create_long_list_from_list_of_lists(list_of_lists):
long_list = []
for one_list in list_of_lists:
for word in one_list:
long_list.append(word)
return long_list
long_list_clean_words = create_long_list_from_list_of_lists(list_clean_words)
long_list_synset = create_long_list_from_list_of_lists(list_synset)
long_list_def_matrix = create_long_list_from_list_of_lists(list_def_matrix)
Turn it into a DataFrame of Uniques Categories
agri_df = pd.DataFrame.from_dict(dict([('Categories', long_list_clean_words), ('Synsets', long_list_synset), ('Definitions', long_list_def_matrix)])).drop_duplicates().reset_index(drop=True)
agri_df.head(4)
Categories Synsets Definitions
0 ceramic ceramic.n.01 an artifact made of hard brittle material prod...
1 horticultural horticultural.a.01 of or relating to the cultivation of plants
2 construction construction.n.01 the act of constructing something
3 building building.n.01 a structure that has a roof and walls and stan...
Final Note
import from nltk.tokenize import TreebankWordTokenizer as tok
or:
import from nltk.tokenize import word_tokenize
to use:
tok().tokenize(string_text_phrase) # text is a string phrase, not a list of words
or:
word_tokenize(string_text_phrase)
Both methods appear to produce the same output, which is a list of words.
input = "Agricultural and domestic animal services"
output_of_both_methods = ['Agricultural', 'and', 'domestic', 'animal', 'services']

Add and remove words from the NLTK stopwords list

I'm trying to add and remove words from the NLTK stopwords list:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('french'))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['cette', 'les', 'cet']
new_stopwords_list = set(stop_words.extend(new_stopwords))
#remove words that are in NLTK stopwords list
not_stopwords = {'n', 'pas', 'ne'}
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])
print(final_stop_words)
Output:
Traceback (most recent call last):
File "test_stop.py", line 10, in <module>
new_stopwords_list = set(stop_words.extend(new_stopwords))
AttributeError: 'set' object has no attribute 'extend'
Try this:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('french'))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['cette', 'les', 'cet']
new_stopwords_list = stop_words.union(new_stopwords)
#remove words that are in NLTK stopwords list
not_stopwords = {'n', 'pas', 'ne'}
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])
print(final_stop_words)
You can use update instead of extend and replace this line new_stopwords_list = set(stop_words.extend(new_stopwords)) in this way:
stop_words.update(new_stopwords)
new_stopwords_list = set(stop_words)
By the way, it can be confusing if you call a set with a name which contains the word list
Do list(set(...)) insted of set(...) because only lists have an method called extend:
...
stop_words = list(set(stopwords.words('french')))
...

Classification using movie review corpus in NLTK/Python

I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt), but I would prefer to create directories I could dump files into.
from nltk.corpus import movie_reviews
reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words=nltk.FreqDist(
w.lower()
for w in movie_reviews.words()
if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation)
word_features = all_words.keys()[:100]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(movie_reviews.words('pos/11.txt'))
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
This returns:
File "test.py", line 38, in <module>
for w in movie_reviews.words()
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
self, self._resolve(fileids, categories))
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
in self.abspaths(fileids, True, True)])
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
---------UPDATE-------------
Thanks alvas for your detailed answer! I have two questions, however.
Is it possible to grab the category from the filename as I was attempting to do? I was hoping to do it in the same vein as the review_pos.txt method, only grabbing the pos from the folder name rather than the file name.
I ran your code and am experiencing a syntax error on
train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in
documents[:numtrain]]
test_set = [({i:(i in tokens) for i in
word_features}, tag) for tokens,tag in documents[numtrain:]]
with the carrot under the first for. I'm a beginner Python user and I'm not familiar enough with that bit of syntax to try to toubleshoot it.
----UPDATE 2----
Error is
File "review.py", line 17
for i in word_features}, tag)
^
SyntaxError: invalid syntax`
Yes, the tutorial on chapter 6 is aim for a basic knowledge for students and from there, the students should build on it by exploring what's available in NLTK and what's not. So let's go through the problems one at a time.
Firstly, the way to get 'pos' / 'neg' documents through the directory is most probably the right thing to do, since the corpus was organized that way.
from nltk.corpus import movie_reviews as mr
from collections import defaultdict
documents = defaultdict(list)
for i in mr.fileids():
documents[i.split('/')[0]].append(i)
print documents['pos'][:10] # first ten pos reviews.
print
print documents['neg'][:10] # first ten neg reviews.
[out]:
['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt']
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt']
Alternatively, I like a list of tuples where the first is element is the list of words in the .txt file and second is the category. And while doing so also remove the stopwords and punctuations:
from nltk.corpus import movie_reviews as mr
import string
from nltk.corpus import stopwords
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
Next is the error at FreqDist(for w in movie_reviews.words() ...). There is nothing wrong with your code, just that you should try to use namespace (see http://en.wikipedia.org/wiki/Namespace#Use_in_common_languages). The following code:
from nltk.corpus import movie_reviews as mr
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
stop = stopwords.words('english')
all_words = FreqDist(w.lower() for w in mr.words() if w.lower() not in stop and w.lower() not in string.punctuation)
print all_words
[outputs]:
<FreqDist: 'film': 9517, 'one': 5852, 'movie': 5771, 'like': 3690, 'even': 2565, 'good': 2411, 'time': 2411, 'story': 2169, 'would': 2109, 'much': 2049, ...>
Since the above code prints the FreqDist correctly, the error seems like you do not have the files in nltk_data/ directory.
The fact that you have fic/11.txt suggests that you're using some older version of the NLTK or NLTK corpora. Normally the fileids in movie_reviews, starts with either pos/neg then a slash then the filename and finally .txt , e.g. pos/cv001_18431.txt.
So I think, maybe you should redownload the files with:
$ python
>>> import nltk
>>> nltk.download()
Then make sure that the movie review corpus is properly downloaded under the corpora tab:
Back to the code, looping through all the words in the movie review corpus seems redundant if you already have all the words filtered in your documents, so i would rather do this to extract all featureset:
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
Next, splitting the train/test by features is okay but i think it's better to use documents, so instead of this:
featuresets = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
I would recommend this instead:
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
Then feed the data into the classifier and voila! So here's the code without the comments and walkthrough:
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
[out]:
0.655
Most Informative Features
bad = True neg : pos = 2.0 : 1.0
script = True neg : pos = 1.5 : 1.0
world = True pos : neg = 1.5 : 1.0
nothing = True neg : pos = 1.5 : 1.0
bad = False pos : neg = 1.5 : 1.0

counting n-gram frequency in python nltk

I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.
import nltk
from nltk.collocations import *
line = ""
open_file = open('a_text_file','r')
for val in open_file:
line += val
tokens = line.split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
print finder.nbest(bigram_measures.pmi, 100)
NLTK comes with its own bigrams generator, as well as a convenient FreqDist() function.
f = open('a_text_file')
raw = f.read()
tokens = nltk.word_tokenize(raw)
#Create your bigrams
bgs = nltk.bigrams(tokens)
#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
print k,v
Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.
Hope that helps.
The finder.ngram_fd.viewitems() function works
I tried all the above and found a simpler solution. NLTK comes with a simple Most Common freq Ngrams.
filtered_sentence is my word tokens
import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
word_fd = nltk.FreqDist(filtered_sentence)
bigram_fd = nltk.FreqDist(nltk.bigrams(filtered_sentence))
bigram_fd.most_common()
This should give the output as:
[(('working', 'hours'), 31),
(('9', 'hours'), 14),
(('place', 'work'), 13),
(('reduce', 'working'), 11),
(('improve', 'experience'), 9)]
from nltk import FreqDist
from nltk.util import ngrams
def compute_freq():
textfile = open('corpus.txt','r')
bigramfdist = FreqDist()
threeramfdist = FreqDist()
for line in textfile:
if len(line) > 1:
tokens = line.strip().split(' ')
bigrams = ngrams(tokens, 2)
bigramfdist.update(bigrams)
compute_freq()

n-grams with Naive Bayes classifier

Im new to python and need help!
i was practicing with python NLTK text classification.
Here is the code example i am practicing on
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Ive tried this one
from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict
train_samples = {}
with file ('positive.txt', 'rt') as f:
for line in f.readlines():
train_samples[line]='pos'
with file ('negative.txt', 'rt') as d:
for line in d.readlines():
train_samples[line]='neg'
f=open("test.txt", "r")
test_samples=f.readlines()
def bigramReturner(text):
tweetString = text.lower()
bigramFeatureVector = {}
for item in bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
def get_labeled_features(samples):
word_freqs = {}
for text, label in train_samples.items():
tokens = text.split()
for token in tokens:
if token not in word_freqs:
word_freqs[token] = {'pos': 0, 'neg': 0}
word_freqs[token][label] += 1
return word_freqs
def get_label_probdist(labeled_features):
label_fd = FreqDist()
for item,counts in labeled_features.items():
for label in ['neg','pos']:
if counts[label] > 0:
label_fd.inc(label)
label_probdist = ELEProbDist(label_fd)
return label_probdist
def get_feature_probdist(labeled_features):
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
num_samples = len(train_samples) / 2
for token, counts in labeled_features.items():
for label in ['neg','pos']:
feature_freqdist[label, token].inc(True, count=counts[label])
feature_freqdist[label, token].inc(None, num_samples - counts[label])
feature_values[token].add(None)
feature_values[token].add(True)
for item in feature_freqdist.items():
print item[0],item[1]
feature_probdist = {}
for ((label, fname), freqdist) in feature_freqdist.items():
probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
feature_probdist[label,fname] = probdist
return feature_probdist
labeled_features = get_labeled_features(train_samples)
label_probdist = get_label_probdist(labeled_features)
feature_probdist = get_feature_probdist(labeled_features)
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
for sample in test_samples:
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
but getting this error, why?
Traceback (most recent call last):
File "C:\python\naive_test.py", line 76, in <module>
print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
File "C:\python\naive_test.py", line 23, in bigramReturner
bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'
A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.
As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.
import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)
Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.
Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.
Edit 1:
There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence
"I drive a truck"
the word unigram features would be each of those 4 words while the word bigram features would be:
["I drive", "drive a", "a truck"]
Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.
def bigramReturner (tweetString):
tweetString = tweetString.lower()
tweetString = removePunctuation (tweetString)
bigramFeatureVector = []
for item in nltk.bigrams(tweetString.split()):
bigramFeatureVector.append(' '.join(item))
return bigramFeatureVector
Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

Categories

Resources