counting n-gram frequency in python nltk - python

I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.
import nltk
from nltk.collocations import *
line = ""
open_file = open('a_text_file','r')
for val in open_file:
line += val
tokens = line.split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
print finder.nbest(bigram_measures.pmi, 100)

NLTK comes with its own bigrams generator, as well as a convenient FreqDist() function.
f = open('a_text_file')
raw = f.read()
tokens = nltk.word_tokenize(raw)
#Create your bigrams
bgs = nltk.bigrams(tokens)
#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
print k,v
Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.
Hope that helps.

The finder.ngram_fd.viewitems() function works

I tried all the above and found a simpler solution. NLTK comes with a simple Most Common freq Ngrams.
filtered_sentence is my word tokens
import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
word_fd = nltk.FreqDist(filtered_sentence)
bigram_fd = nltk.FreqDist(nltk.bigrams(filtered_sentence))
bigram_fd.most_common()
This should give the output as:
[(('working', 'hours'), 31),
(('9', 'hours'), 14),
(('place', 'work'), 13),
(('reduce', 'working'), 11),
(('improve', 'experience'), 9)]

from nltk import FreqDist
from nltk.util import ngrams
def compute_freq():
textfile = open('corpus.txt','r')
bigramfdist = FreqDist()
threeramfdist = FreqDist()
for line in textfile:
if len(line) > 1:
tokens = line.strip().split(' ')
bigrams = ngrams(tokens, 2)
bigramfdist.update(bigrams)
compute_freq()

Related

TFIDFVectorizer making concatenated word tokens

I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.
I am using the following code to achieve it:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
content_string=""
content = [line.replace('\n','') for line in f]
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
#some data cleaning
doc = [line.replace('.T',' ').replace('.B',' ').replace('.A',' ').replace('.W',' ') for line in doc]
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
I have attached below a few examples I obtain when I print vocabulary:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
When I use with small data, it does not happen. How to prevent this from happening?
This happens because there are two words are commonly used together .It seems that the concatenated words are resulting from the n-gram generation in the TfidfVectorizer. When you set ngram_range=(1,1), the vectorizer only considers single words. However, when you increase the ngram_range, the vectorizer considers n-grams of words
You can use regular expression to avoid this.
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
# Remove n-grams that have two words concatenated
pattern = r'\b\w+\w\b'
vectorizer.vocabulary_ = {key: val for key, val in vectorizer.vocabulary_.items() if re.match(pattern, key)}
pattern \b\w+\w\b matches n-grams that have two words concatenated, such as freevibration The resulting vocabulary_ dictionary will not contain these n-grams
My guess would be that the issue is caused by this line:
content = [line.replace('\n','') for line in f]
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content = [line.replace('\n',' ') for line in f]
---
(note the space between '')

How to remove an entire line if it does not have a pos tag like CD?

I am reading a news article and pos-tagging with nltk. I want to remove those lines that does not have a pos tag like CD (numbers).
import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english'))
file1 = open("etorg.txt")
line = file1.read()
file1.close()
print(line)
words = line.split()
tokens = nltk.pos_tag(words)
How do I remove all sentences that do not contain the CD tag?
Just use [word for word in tokens if word[1] != 'CD']
EDIT: To get the sentences that have no numbers, use this code:
def has_number(sentence):
for i in nltk.pos_tag(sentence.split()):
if i[1] == 'CD':
return ''
return sentence
line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '
''.join([has_number(x) for x in line.split('.')])
> ' However, industry sources do not confirm this data '

CSV file with label

As suggested here Python Tf idf algorithm I use this code to get the frequency of words over a set of documents.
import pandas as pd
import csv
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import codecs
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
with codecs.open("book1.txt",'r','utf-8') as i1,\
codecs.open("book2.txt",'r','utf-8') as i2,\
codecs.open("book3.txt",'r','utf-8') as i3:
# your corpus
t1=i1.read().replace('\n',' ')
t2=i2.read().replace('\n',' ')
t3=i3.read().replace('\n',' ')
text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")
With the last line, I create a csv file where are listed all words and their frequency. Is there a way to put a label to them, to see if a word belong only to the third document, or to all?
My goal is to delete from the csv file all the words that appear only in the 3rd document (book3)
You can use the isin() attribute to filter out your top_words in the third book from the top_ words in the entire corpus.
(For the example below I downloaded three random books from http://www.gutenberg.org/)
import codecs
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# import nltk
# nltk.download('punkt')
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
with codecs.open("56732-0.txt",'r','utf-8') as i1,\
codecs.open("56734-0.txt",'r','utf-8') as i2,\
codecs.open("56736-0.txt",'r','utf-8') as i3:
# your corpus
t1=i1.read().replace('\n',' ')
t2=i2.read().replace('\n',' ')
t3=i3.read().replace('\n',' ')
text = [t1,t2,t3]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
# top_words for the 3rd book alone
text = [" ".join(tokenize(t3.lower()))]
matrix = vectorizer.fit_transform(text).todense()
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
top_words3 = matrix.sum(axis=0).sort_values(ascending=False)
# Mask out words in t3
mask = ~top_words.index.isin(top_words3.index)
# Filter those words from top_words
top_words = top_words[mask]
top_words.to_csv('dict.csv', index=True, float_format="%f",encoding="utf-8")

Bigrams and .Join

I got this function and I tried to edit it a little for my purpose
but instead of getting bigrams I get unigrams. what do I need to add or edit?
I am really new with python and nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import re
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [x for x in tokens if x not in stopwords.words('english') and len(x) > 3]
return result
filename = raw_input('Enter File Name :')
word_list = re.split('\s+', file(filename).read().lower())
f=open ('test2.csv', 'w')
for line in word_list:
features = get_bigrams(line)
print features
f.write(str(line))
f.write("\n")
the output of for an exmple "It has been a long time"
It
has
been
a
long
time
Yet I am looking for something like
It has
has been
been a
a long
long time
I think your problem is how you tackle the file reading and the line processing:
The following line gives you a list of words (as the name suggests)
word_list = re.split('\s+', file(filename).read().lower())
but later you treat each single word as a line:
for line in word_list:
This just means that your code can simply not work.
If I understand you correctly you might want to change file reading in the following way:
filename = raw_input('Enter File Name :')
lines = file(filename).readlines()
f = open('test2.csv', 'w')
for line in lines:
features = get_bigrams(line)
# do more things
Nltk seems like overkill here. Why not just do:
def pairs(seq):
return zip(seq, seq[1:])
s = "It has been a long time"
words = s.split()
for bigram in pairs(words):
print bigram
Result:
('It', 'has')
('has', 'been')
('been', 'a')
('a', 'long')
('long', 'time')
Your function get_bigrams seems to work for me so I think the problem is your file or the way you read it.
By the way, Id like to suggest a shorter code for get_bigrams:
import nltk
def get_bigrams(sentence):
tokens = nltk.word_tokenize(sentence)
return zip(tokens, tokens[1:])
Use:
>>> [' '.join(b) for b in get_bigrams("It has been a long time")]
['It has', 'has been', 'been a', 'a long', 'long time']

NLTK: Naive Bayes - where/how to add in ngrams?

I am doing a classification task on tweets (3 labels= pos, neg, neutral), for which I'm using Naive Bayes in NLTK. I'd like to add in ngrams (bigrams) as well. I have tried adding them to the code, but I don't seem to get where to fit them right in. At the moment it seems as if I'm "breaking" the code, no matter where I add in the bigrams. Could anybody please help me out, or redirect me to a tutorial?
My code for unigrams follows. If you need any information on how the datasets look, I'd be happy to provide it.
import nltk
import csv
import random
import nltk.classify.util, nltk.metrics
import codecs
import re, math, collections, itertools
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.util import ngrams
from nltk import bigrams
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords = True)
stopset = set(stopwords.words('english'))
stopset.add('username')
stopset.add('url')
stopset.add('percentage')
stopset.add('number')
stopset.add('at_user')
stopset.add('AT_USER')
stopset.add('URL')
stopset.add('percentagenumber')
inpTweets = []
##with open('sanders.csv', 'r', 'utf-8') as f: #input sanders
## reader = csv.reader(f, delimiter = ';')
## for row in reader:
## inpTweets.append((row))
reader = codecs.open('...sanders.csv', 'r', encoding='utf-8-sig') #input classified tweets
for line in reader:
line = line.rstrip()
row = line.split(';')
inpTweets.append((row))
def processTweet(tweet):
tweet = tweet.lower()
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
tweet = re.sub('#[^\s]+','AT_USER',tweet)
tweet = re.sub('[\s]+', ' ', tweet)
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
tweet = tweet.strip('\'"')
return tweet
def replaceTwoOrMore(s):
#look for 2 or more repetitions of character and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
return pattern.sub(r"\1\1", s)
def preprocessing(doc):
tokens = tokenizer.tokenize(doc)
bla = []
for x in tokens:
if len(x)>2:
if x not in stopset:
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", x)
if val is not None:
x = replaceTwoOrMore(x)
x = processTweet(x)
x = x.strip('\'"?,.')
x = stemmer.stem(x).lower()
bla.append(x)
return bla
xyz = []
for lijn in inpTweets:
xyz.append((preprocessing (lijn[0]),lijn[1]))
random.shuffle(xyz)
featureList = []
k = 0
while k in range (0, len(xyz)):
featureList.extend(xyz[k][0])
k = k + 1
fd = nltk.FreqDist(featureList)
featureList = list(fd.keys())[2000:]
def document_features(doc):
features = {}
document_words = set(doc)
for word in featureList:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = nltk.classify.util.apply_features(document_features, xyz)
training_set, test_set = featuresets[2000:], featuresets[:2000]
classifier = nltk.NaiveBayesClassifier.train(training_set)
Your code uses the 2000 most common words as the classification features. Just select the bigrams you want to use, and convert them to features in document_features(). A feature like "contains (the dog)" will work just like "contains (dog)".
An interesting approach is using a sequential backoff tagger, which allows you to chain taggers together: in this way you could train a n-gram tagger and a Naive Bayes and chain them togheter.

Categories

Resources