I got this function and I tried to edit it a little for my purpose
but instead of getting bigrams I get unigrams. what do I need to add or edit?
I am really new with python and nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import re
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [x for x in tokens if x not in stopwords.words('english') and len(x) > 3]
return result
filename = raw_input('Enter File Name :')
word_list = re.split('\s+', file(filename).read().lower())
f=open ('test2.csv', 'w')
for line in word_list:
features = get_bigrams(line)
print features
f.write(str(line))
f.write("\n")
the output of for an exmple "It has been a long time"
It
has
been
a
long
time
Yet I am looking for something like
It has
has been
been a
a long
long time
I think your problem is how you tackle the file reading and the line processing:
The following line gives you a list of words (as the name suggests)
word_list = re.split('\s+', file(filename).read().lower())
but later you treat each single word as a line:
for line in word_list:
This just means that your code can simply not work.
If I understand you correctly you might want to change file reading in the following way:
filename = raw_input('Enter File Name :')
lines = file(filename).readlines()
f = open('test2.csv', 'w')
for line in lines:
features = get_bigrams(line)
# do more things
Nltk seems like overkill here. Why not just do:
def pairs(seq):
return zip(seq, seq[1:])
s = "It has been a long time"
words = s.split()
for bigram in pairs(words):
print bigram
Result:
('It', 'has')
('has', 'been')
('been', 'a')
('a', 'long')
('long', 'time')
Your function get_bigrams seems to work for me so I think the problem is your file or the way you read it.
By the way, Id like to suggest a shorter code for get_bigrams:
import nltk
def get_bigrams(sentence):
tokens = nltk.word_tokenize(sentence)
return zip(tokens, tokens[1:])
Use:
>>> [' '.join(b) for b in get_bigrams("It has been a long time")]
['It has', 'has been', 'been a', 'a long', 'long time']
Related
I am using the Cranfield Dataset to make an Indexer and Query Processor. For that purpose I am using TFIDFVectorizer to tokenize the data. But after using TFIDFVectorizer when I check the vocabulary,there were lot of tokens formed using a concatenation of two words.
I am using the following code to achieve it:
import re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
#reading the data
with open('cran.all', 'r') as f:
content_string=""
content = [line.replace('\n','') for line in f]
content = content_string.join(content)
doc=re.split('.I\s[0-9]{1,4}',content)
f.close()
#some data cleaning
doc = [line.replace('.T',' ').replace('.B',' ').replace('.A',' ').replace('.W',' ') for line in doc]
del doc[0]
doc= [ re.sub('[^A-Za-z]+', ' ', lines) for lines in doc]
vectorizer = TfidfVectorizer(analyzer ='word', ngram_range=(1,1), stop_words=text.ENGLISH_STOP_WORDS,lowercase=True)
X = vectorizer.fit_transform(doc)
print(vectorizer.vocabulary_)
I have attached below a few examples I obtain when I print vocabulary:
'freevibration': 7222, 'slendersharp': 15197, 'frequentlyapproximated': 7249, 'notapplicable': 11347, 'rateof': 13727, 'itsvalue': 9443, 'speedflow': 15516, 'movingwith': 11001, 'speedsolution': 15531, 'centerof': 3314, 'hypersoniclow': 8230, 'neice': 11145, 'rutkowski': 14444, 'chann': 3381, 'layerapproximations': 9828, 'probsteinhave': 13353, 'thishypersonic': 17752
When I use with small data, it does not happen. How to prevent this from happening?
This happens because there are two words are commonly used together .It seems that the concatenated words are resulting from the n-gram generation in the TfidfVectorizer. When you set ngram_range=(1,1), the vectorizer only considers single words. However, when you increase the ngram_range, the vectorizer considers n-grams of words
You can use regular expression to avoid this.
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), stop_words=text.ENGLISH_STOP_WORDS, lowercase=True)
X = vectorizer.fit_transform(doc)
# Remove n-grams that have two words concatenated
pattern = r'\b\w+\w\b'
vectorizer.vocabulary_ = {key: val for key, val in vectorizer.vocabulary_.items() if re.match(pattern, key)}
pattern \b\w+\w\b matches n-grams that have two words concatenated, such as freevibration The resulting vocabulary_ dictionary will not contain these n-grams
My guess would be that the issue is caused by this line:
content = [line.replace('\n','') for line in f]
When replacing line breaks, the last word of line 1 is concatenated with the first word of line 2. And of course this happens for every line, so you get a lot of these. The solution is super simple: instead of replacing line break with nothing (i.e. just removing them), replace them with a whitespace:
content = [line.replace('\n',' ') for line in f]
---
(note the space between '')
I am reading a news article and pos-tagging with nltk. I want to remove those lines that does not have a pos tag like CD (numbers).
import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english'))
file1 = open("etorg.txt")
line = file1.read()
file1.close()
print(line)
words = line.split()
tokens = nltk.pos_tag(words)
How do I remove all sentences that do not contain the CD tag?
Just use [word for word in tokens if word[1] != 'CD']
EDIT: To get the sentences that have no numbers, use this code:
def has_number(sentence):
for i in nltk.pos_tag(sentence.split()):
if i[1] == 'CD':
return ''
return sentence
line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '
''.join([has_number(x) for x in line.split('.')])
> ' However, industry sources do not confirm this data '
I was given this formula called FRES (Flesch reading-ease test) that is used to measure the readability of a document:
My task is to write a python function that returns the FRES of a text. Hence I need to convert this formula into a python function.
I have re-implemented my code from a answer I got to show what I have so far and the result it has given me:
import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
import re
from itertools import chain
from nltk.corpus import gutenberg
VC = re.compile('[aeiou]+[^aeiou]+', re.I)
def count_syllables(word):
return len(VC.findall(word))
def compute_fres(text):
"""Return the FRES of a text.
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> compute_fres(emma) # doctest: +ELLIPSIS
99.40...
"""
for filename in gutenberg.fileids():
sents = gutenberg.sents(filename)
words = gutenberg.words(filename)
num_sents = len(sents)
num_words = len(words)
num_syllables = sum(count_syllables(w) for w in words)
score = 206.835 - 1.015 * (num_words / num_sents) - 84.6 * (num_syllables / num_words)
return(score)
After running the code this is the result message I got:
Failure
Expected :99.40...
Actual :92.84866041488623
File "C:/Users/PycharmProjects/a1/a1.py", line 60, in a1.compute_fres
Failed example:
compute_fres(emma) # doctest: +ELLIPSIS
Expected:
99.40...
Got:
92.84866041488623
My function is supposed to pass the doctest and result in 99.40... And I'm also not allowed to edit the syllables function since it came with the task:
import re
VC = re.compile('[aeiou]+[^aeiou]+', re.I)
def count_syllables(word):
return len(VC.findall(word))
This question has being very tricky but at least now it's giving me a result instead of an error message, not sure why it's giving me a different result though.
Any help will be very appreciated. Thank you.
BTW, there's the textstat library.
from textstat.textstat import textstat
from nltk.corpus import gutenberg
for filename in gutenberg.fileids():
print(filename, textstat.flesch_reading_ease(filename))
If you're bent on coding up your own, first you've to
decide if a punctuation is a word
define how to count no. of syllables in the word.
If punctuation is a word and syllables is counted by the regex in your question, then:
import re
from itertools import chain
from nltk.corpus import gutenberg
def num_syllables_per_word(word):
return len(re.findall('[aeiou]+[^aeiou]+', word))
for filename in gutenberg.fileids():
sents = gutenberg.sents(filename)
words = gutenberg.words(filename) # i.e. list(chain(*sents))
num_sents = len(sents)
num_words = len(words)
num_syllables = sum(num_syllables_per_word(w) for w in words)
score = 206.835 - 1.015 * (num_words / num_sents) - 84.6 * (num_syllables / num_words)
print(filename, score)
I am doing a classification task on tweets (3 labels= pos, neg, neutral), for which I'm using Naive Bayes in NLTK. I'd like to add in ngrams (bigrams) as well. I have tried adding them to the code, but I don't seem to get where to fit them right in. At the moment it seems as if I'm "breaking" the code, no matter where I add in the bigrams. Could anybody please help me out, or redirect me to a tutorial?
My code for unigrams follows. If you need any information on how the datasets look, I'd be happy to provide it.
import nltk
import csv
import random
import nltk.classify.util, nltk.metrics
import codecs
import re, math, collections, itertools
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.util import ngrams
from nltk import bigrams
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords = True)
stopset = set(stopwords.words('english'))
stopset.add('username')
stopset.add('url')
stopset.add('percentage')
stopset.add('number')
stopset.add('at_user')
stopset.add('AT_USER')
stopset.add('URL')
stopset.add('percentagenumber')
inpTweets = []
##with open('sanders.csv', 'r', 'utf-8') as f: #input sanders
## reader = csv.reader(f, delimiter = ';')
## for row in reader:
## inpTweets.append((row))
reader = codecs.open('...sanders.csv', 'r', encoding='utf-8-sig') #input classified tweets
for line in reader:
line = line.rstrip()
row = line.split(';')
inpTweets.append((row))
def processTweet(tweet):
tweet = tweet.lower()
tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
tweet = re.sub('#[^\s]+','AT_USER',tweet)
tweet = re.sub('[\s]+', ' ', tweet)
tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
tweet = tweet.strip('\'"')
return tweet
def replaceTwoOrMore(s):
#look for 2 or more repetitions of character and replace with the character itself
pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
return pattern.sub(r"\1\1", s)
def preprocessing(doc):
tokens = tokenizer.tokenize(doc)
bla = []
for x in tokens:
if len(x)>2:
if x not in stopset:
val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", x)
if val is not None:
x = replaceTwoOrMore(x)
x = processTweet(x)
x = x.strip('\'"?,.')
x = stemmer.stem(x).lower()
bla.append(x)
return bla
xyz = []
for lijn in inpTweets:
xyz.append((preprocessing (lijn[0]),lijn[1]))
random.shuffle(xyz)
featureList = []
k = 0
while k in range (0, len(xyz)):
featureList.extend(xyz[k][0])
k = k + 1
fd = nltk.FreqDist(featureList)
featureList = list(fd.keys())[2000:]
def document_features(doc):
features = {}
document_words = set(doc)
for word in featureList:
features['contains(%s)' % word] = (word in document_words)
return features
featuresets = nltk.classify.util.apply_features(document_features, xyz)
training_set, test_set = featuresets[2000:], featuresets[:2000]
classifier = nltk.NaiveBayesClassifier.train(training_set)
Your code uses the 2000 most common words as the classification features. Just select the bigrams you want to use, and convert them to features in document_features(). A feature like "contains (the dog)" will work just like "contains (dog)".
An interesting approach is using a sequential backoff tagger, which allows you to chain taggers together: in this way you could train a n-gram tagger and a Naive Bayes and chain them togheter.
I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.
import nltk
from nltk.collocations import *
line = ""
open_file = open('a_text_file','r')
for val in open_file:
line += val
tokens = line.split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
print finder.nbest(bigram_measures.pmi, 100)
NLTK comes with its own bigrams generator, as well as a convenient FreqDist() function.
f = open('a_text_file')
raw = f.read()
tokens = nltk.word_tokenize(raw)
#Create your bigrams
bgs = nltk.bigrams(tokens)
#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
print k,v
Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.
Hope that helps.
The finder.ngram_fd.viewitems() function works
I tried all the above and found a simpler solution. NLTK comes with a simple Most Common freq Ngrams.
filtered_sentence is my word tokens
import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
word_fd = nltk.FreqDist(filtered_sentence)
bigram_fd = nltk.FreqDist(nltk.bigrams(filtered_sentence))
bigram_fd.most_common()
This should give the output as:
[(('working', 'hours'), 31),
(('9', 'hours'), 14),
(('place', 'work'), 13),
(('reduce', 'working'), 11),
(('improve', 'experience'), 9)]
from nltk import FreqDist
from nltk.util import ngrams
def compute_freq():
textfile = open('corpus.txt','r')
bigramfdist = FreqDist()
threeramfdist = FreqDist()
for line in textfile:
if len(line) > 1:
tokens = line.strip().split(' ')
bigrams = ngrams(tokens, 2)
bigramfdist.update(bigrams)
compute_freq()