I was given this formula called FRES (Flesch reading-ease test) that is used to measure the readability of a document:
My task is to write a python function that returns the FRES of a text. Hence I need to convert this formula into a python function.
I have re-implemented my code from a answer I got to show what I have so far and the result it has given me:
import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
import re
from itertools import chain
from nltk.corpus import gutenberg
VC = re.compile('[aeiou]+[^aeiou]+', re.I)
def count_syllables(word):
return len(VC.findall(word))
def compute_fres(text):
"""Return the FRES of a text.
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> compute_fres(emma) # doctest: +ELLIPSIS
99.40...
"""
for filename in gutenberg.fileids():
sents = gutenberg.sents(filename)
words = gutenberg.words(filename)
num_sents = len(sents)
num_words = len(words)
num_syllables = sum(count_syllables(w) for w in words)
score = 206.835 - 1.015 * (num_words / num_sents) - 84.6 * (num_syllables / num_words)
return(score)
After running the code this is the result message I got:
Failure
Expected :99.40...
Actual :92.84866041488623
File "C:/Users/PycharmProjects/a1/a1.py", line 60, in a1.compute_fres
Failed example:
compute_fres(emma) # doctest: +ELLIPSIS
Expected:
99.40...
Got:
92.84866041488623
My function is supposed to pass the doctest and result in 99.40... And I'm also not allowed to edit the syllables function since it came with the task:
import re
VC = re.compile('[aeiou]+[^aeiou]+', re.I)
def count_syllables(word):
return len(VC.findall(word))
This question has being very tricky but at least now it's giving me a result instead of an error message, not sure why it's giving me a different result though.
Any help will be very appreciated. Thank you.
BTW, there's the textstat library.
from textstat.textstat import textstat
from nltk.corpus import gutenberg
for filename in gutenberg.fileids():
print(filename, textstat.flesch_reading_ease(filename))
If you're bent on coding up your own, first you've to
decide if a punctuation is a word
define how to count no. of syllables in the word.
If punctuation is a word and syllables is counted by the regex in your question, then:
import re
from itertools import chain
from nltk.corpus import gutenberg
def num_syllables_per_word(word):
return len(re.findall('[aeiou]+[^aeiou]+', word))
for filename in gutenberg.fileids():
sents = gutenberg.sents(filename)
words = gutenberg.words(filename) # i.e. list(chain(*sents))
num_sents = len(sents)
num_words = len(words)
num_syllables = sum(num_syllables_per_word(w) for w in words)
score = 206.835 - 1.015 * (num_words / num_sents) - 84.6 * (num_syllables / num_words)
print(filename, score)
Related
First step is tokenizing the text from dataframe using NLTK. Then, I create a spelling correction using TextBlob. For this, I convert the output from tuple to string. After that, I need to lemmatize/stem (using NLTK). The problem is my output return in a strip-format. Thus, it cannot be lemmatized/stemmed.
#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)
#spelling correction
def spell_eng(text):
text=TextBlob(str(text)).correct()
#convert from tuple to str
text=functools.reduce(operator.add, (text))
return text
df['text3'] = df['text2'].apply(spell_eng)
#lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
Generated output:
Desired output:
text4
--------------
[spell]
[be]
[work,cook,listen]
[study]
I got where the problem is, the dataframes are storing these arrays as a string. So, the lemmatization is not working. Also note that, it is from the spell_eng part.
I have written a solution, which is a slight modification for your code.
import pandas as pd
import nltk
from textblob import TextBlob
import functools
import operator
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(tokenize)
# spelling correction
def spell_eng(text):
text = [TextBlob(str(w)).correct() for w in text] #CHANGE
#convert from tuple to str
text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
return text
df['text3'] = df['text2'].apply(spell_eng)
# lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
df['text4']
Hope these things help.
Consider the following for spell-correction:
from autocorrect import spell
import re
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
Here is the output for above piece of code:
How I can stop displaying the output in jupyter notebook? Particularly if we have a large number of text documents, it will be lots of outputs.
My second question is: how can I improve the speed and accuracy (please check the word "veri" in the output for example) of the code when applying on a large data? Is there any better way to do this? I appreciate your response and (alternative) solutions with better speed.
As #khelwood said in the comments, you should use autocorrect.Speller:
from autocorrect import Speller
import re
spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']
As an alternative, you could use a list comprehension to maybe increase the speed, and also you could use the library pyspellchecker, which improves the accuracy of the word 'veri' in this case:
from spellchecker import SpellChecker
import re
WORD = re.compile(r'\w+')
spell = SpellChecker()
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = [' '.join([spell.correction(w).lower() for w in reTokenize(doc)]) for doc in text]
return sptext
print(spell_correct(text))
Output:
['hi welcome to spelling', 'this is just an example but consider a very big corpus']
I would like to calculate the cosine similarity for the consecutive pairs of articles in a JSON file. So far I manage to do it but.... I just realize that when transforming the tfidf of each article I am not using the terms from all articles available in the file but only those from each pair. Here is the code that I am using which provides the cosine-similarity coefficient of each consecutive pair of articles.
import json
import nltk
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a super function is created that contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
## Calculation one by one of the cosine similatrity
def foo(x, y):
tfidf = vectorizer.fit_transform([x, y])
return ((tfidf * tfidf.T).A)[0,1]
my_funcs = {}
for i in range(len(data) - 1):
x = data[i]['body']
y = data[i+1]['body']
foo.func_name = "cosine_sim%d" % i
my_funcs["cosine_sim%d" % i] = foo
print(foo(x,y))
Any idea of how to develop the cosine-similarity using the whole terms of all articles available in the JSON file rather than only those of each pair?
Kind regards,
Andres
I think, based on our discussion above, you need to change the foo function and everything below. See the code below. Note that I haven't actually run this, since I don't have your data and no sample lines are provided.
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
import json
from sklearn.metrics.pairwise import cosine_similarity
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## tfidf
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf_data = vectorizer.fit_transform(data)
#cosine dists
similarity matrix = cosine_similarity(tfidf_data)
I got this function and I tried to edit it a little for my purpose
but instead of getting bigrams I get unigrams. what do I need to add or edit?
I am really new with python and nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import re
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [x for x in tokens if x not in stopwords.words('english') and len(x) > 3]
return result
filename = raw_input('Enter File Name :')
word_list = re.split('\s+', file(filename).read().lower())
f=open ('test2.csv', 'w')
for line in word_list:
features = get_bigrams(line)
print features
f.write(str(line))
f.write("\n")
the output of for an exmple "It has been a long time"
It
has
been
a
long
time
Yet I am looking for something like
It has
has been
been a
a long
long time
I think your problem is how you tackle the file reading and the line processing:
The following line gives you a list of words (as the name suggests)
word_list = re.split('\s+', file(filename).read().lower())
but later you treat each single word as a line:
for line in word_list:
This just means that your code can simply not work.
If I understand you correctly you might want to change file reading in the following way:
filename = raw_input('Enter File Name :')
lines = file(filename).readlines()
f = open('test2.csv', 'w')
for line in lines:
features = get_bigrams(line)
# do more things
Nltk seems like overkill here. Why not just do:
def pairs(seq):
return zip(seq, seq[1:])
s = "It has been a long time"
words = s.split()
for bigram in pairs(words):
print bigram
Result:
('It', 'has')
('has', 'been')
('been', 'a')
('a', 'long')
('long', 'time')
Your function get_bigrams seems to work for me so I think the problem is your file or the way you read it.
By the way, Id like to suggest a shorter code for get_bigrams:
import nltk
def get_bigrams(sentence):
tokens = nltk.word_tokenize(sentence)
return zip(tokens, tokens[1:])
Use:
>>> [' '.join(b) for b in get_bigrams("It has been a long time")]
['It has', 'has been', 'been a', 'a long', 'long time']
I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.
import nltk
from nltk.collocations import *
line = ""
open_file = open('a_text_file','r')
for val in open_file:
line += val
tokens = line.split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
print finder.nbest(bigram_measures.pmi, 100)
NLTK comes with its own bigrams generator, as well as a convenient FreqDist() function.
f = open('a_text_file')
raw = f.read()
tokens = nltk.word_tokenize(raw)
#Create your bigrams
bgs = nltk.bigrams(tokens)
#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
print k,v
Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.
Hope that helps.
The finder.ngram_fd.viewitems() function works
I tried all the above and found a simpler solution. NLTK comes with a simple Most Common freq Ngrams.
filtered_sentence is my word tokens
import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
word_fd = nltk.FreqDist(filtered_sentence)
bigram_fd = nltk.FreqDist(nltk.bigrams(filtered_sentence))
bigram_fd.most_common()
This should give the output as:
[(('working', 'hours'), 31),
(('9', 'hours'), 14),
(('place', 'work'), 13),
(('reduce', 'working'), 11),
(('improve', 'experience'), 9)]
from nltk import FreqDist
from nltk.util import ngrams
def compute_freq():
textfile = open('corpus.txt','r')
bigramfdist = FreqDist()
threeramfdist = FreqDist()
for line in textfile:
if len(line) > 1:
tokens = line.strip().split(' ')
bigrams = ngrams(tokens, 2)
bigramfdist.update(bigrams)
compute_freq()