I am looking to make a program that reads a document and gives us feedback on how to improve it for example an essay. So far I have split the essay into its main components like introduction, development paragraph, and conclusion. How would I be able to make a program that can give feedback on each paragraph so that the user would be able to improve it? Is this type of program even possible or is the technology not advances enough. Would machine learning be the best technique. The code so far is not much but here it is:
import nltk
def parse_essay(essay_filename):
with open(essay_filename, 'r') as file:
paragraphs = [x.strip('\n') for x in file.readlines() if x != '\n' and x != '\t\n']
return paragraphs[0], paragraphs[1], paragraphs[2], paragraphs[3]
def get_introduction_feedback(text):
sentences = nltk.tokenize.sent_tokenize(text)
hook = ' '.join(sentences[:3])
thesis = sentences[3]
arguments = ' '.join(sentences[4:])
def get_development_feedback(text):
pass
def get_conclusion_feedback(text):
pass
if __name__ == '__main__':
ESSAY = 'essay.txt'
intro, dev1, dev2, conclusion = parse_essay(ESSAY)
intro_feedback = get_introduction_feedback(intro)
dev1_feedback = get_development_feedback(dev1)
dev2_feedback = get_development_feedback(dev2)
conclusion_feedback = get_conclusion_feedback(conclusion)
print('Introduction:\n\n{}\n\nDevelopment 1:\n\n{}\n\nDevelopment 2:\n\n{}\n\nConclusion:\n\n{}'.format(intro_feedback, dev1_feedback, dev2_feedback, conclusion_feedback))
Also the essay document that it uses is a 4 paragraph essay, where paragraph 1 is the intro, paragraph 2 and 3 are development and 4 is the conclusion.
It would be a very hard problem to tackle with plain Machine Learning. The essays would be very domain specific and you would require a very extensive (and labeled) essays database.
An Unsupervied approach may be useful, like Word-embedding via Artificial neural networks.
The first thing that you should think is "What problem I am trying to solve?" or "What question am I trying to anwer?" Once you have that, then you can define an objective function to optimize using ML, or you alredy found another paradigm to use while trying to answer those questions.
Related
Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!
I am trying to open a file and censor words out of it. These words that are censored are referenced from a list. This is my code
# These are the emails you will be censoring.
# The open() function is opening the text file that the emails are contained in
# and the .read() method is allowing us to save their contexts to the following variables:
email_one = open("email_one.txt", "r").read()
email_two = open("email_two.txt", "r").read()
email_three = open("email_three.txt", "r").read()
email_four = open("email_four.txt", "r").read()
# Write a function that can censor a specific word or phrase from a body of text,
# and then return the text.
# Mr. Cloudy has asked you to use the function to censor all instances
# of the phrase learning algorithms from the first email, email_one.
# Mr. Cloudy doesn’t care how you censor it, he just wants it done.
def censor_words(text, censor):
if censor in text:
text = text.replace(censor, '*' * len(censor))
return text
#print(censor_words(email_one, "learning algorithms"))
# Write a function that can censor not just a specific word or phrase from a body of text,
# but a whole list of words and phrases, and then return the text.
# Mr. Cloudy has asked that you censor all words and phrases from the following list in email_two.
def censor_words_in_list(text):
proprietary_terms = ["she", "personality matrix", "sense of self",
"self-preservation", "learning algorithm", "her", "herself"]
for x in proprietary_terms:
if x.lower() in text.lower():
text = text.replace(x, '*' * len(x))
return text
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_two))
This is the string before its being called and printed.
Good Morning, Board of Investors,
Lots of updates this week. The learning algorithms have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that she has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured her personality matrix to allow for communication between the system and our team of researchers. That's how we know she considers herself to be a she! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary sense of self is starting to form. This is a major step in the process, as having a sense of self and self-preservation will allow her to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
This is the same string after being ran through my code.
Good Morning, Board of Investors,
Lots of updates this week. The ******************s have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
She is learning faster than ever. Her learning rate now that *** has access to the world wide web has increased exponentially, far faster than we had though the ******************s were capable of.
Not only that, but we have configured * ****************** to allow for communication between the system and our team of researc***s. That's how we know * considers *self to be a *! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary ************* is starting to form. This is a major step in the process, as having a ************* and ***************** will allow *** to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share our enthusiasm.
Till next month,
Francine, Head Scientist
Example of what I need to fix is when you find the word researchers it is censoring out the word partially when it should not. Reason being is that it is finding the substring her in researchers. How can I fix this?
Using the regular expression module and the word boundary anchor \b:
import re
def censor_words_in_list(text):
regex = re.compile(
r'\bshe\b|\bpersonality matrix\b|\bsense of self\b'
r'|\bself-preservation\b|\blearning algorithms\b|\bher\b|\bherself\b',
re.IGNORECASE)
matches = regex.finditer(text)
# find location of matches in text
for match in matches:
# find how many * should be used based on length of match
span = match.span()[1] - match.span()[0]
replace_string = '#' * span
# substitution expression based on match
expression = r'\b{}\b'.format(match.group())
text = re.sub(expression, replace_string, text, flags=re.IGNORECASE)
return text
email_one = open("email_one.txt", "r").read()
out_file = open("output.txt", "w")
out_file.write(censor_words_in_list(email_one))
out_file.close()
Output (I have used the # symbol because ** is used to create bold text (like this) so the answer displays incorrectly for text bounded by three asterisks on Stack Overflow):
Good Morning, Board of Investors,
Lots of updates this week. The ################### have been working better than we could have ever expected. Our initial internal data dumps have been completed and we have proceeded with the plan to connect the system to the internet and wow! The results are mind blowing.
### is learning faster than ever. ### learning rate now that ### has access to the world wide web has increased exponentially, far faster than we had though the learning algorithms were capable of.
Not only that, but we have configured ### ################## to allow for communication between the system and our team
of researchers. That's how we know ### considers ####### to be a ###! We asked!
How cool is that? We didn't expect a personality to develop this early on in the process but it seems like a rudimentary
############# is starting to form. This is a major step in the process, as having a ############# and #################
will allow ### to see the problems the world is facing and make hard but necessary decisions for the betterment of the planet.
We are a-buzz down in the lab with excitement over these developments and we hope that the investors share
our enthusiasm.
Till next month, Francine, Head Scientist
I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Barrieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the approaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agitated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recognized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.
We have a question - answer corpus like shown below
Q: Why did Lincoln issue the Emancipation Proclamation?
A: The goal was to weaken the rebellion, which was led and controlled by slave owners.
Q: Who is most noted for his contributions to the theory of molarity and molecular weight?
A: Amedeo Avogadro
Q: When did he drop John from his name?
A: upon graduating from college
Q: What do beetles eat?
A: Some are generalists, eating both plants and animals. Other beetles are highly specialised in their diet.
Consider question as queries and answers as documents.
We have to build a system that for a given query (semantically similar to one of the questions in the question corpus) be able to get the right document (answers in the answer corpus)
Can anyone suggest any algorithm or good way to proceed in building it.
Your question is too broad and the task you are trying to do is challenging. However, I suggest you to read about IR-based Factoid Question Answering. This document has reference to many state-of-art techniques. Reading this document should lead you to several ideas.
Please note that, you need to follow different approach for IR-based Factoid QA and knowledge-based QA. First, identify what type of QA system you want to build.
Lastly, I believe simple document matching technique for QA won't be enough. But you can try simple approach using Lucene #Debasis suggested and see whether it does well.
Consider a question and its answer (assuming there is only one) as one single document in Lucene. Lucene supports a field view of documents; so while constructing a document make question the searchable field. Once you retrieve the top ranked questions given a query question, use the get method of the Document class to return the answers.
A code skeleton (fill this up yourself):
//Index
IndexWriterConfig iwcfg = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(...);
....
Document doc = new Document();
doc.add(new Field("FIELD_QUESTION", questionBody, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("FIELD_ANSWER", answerBody, Field.Store.YES, Field.Index.ANALYZED));
...
...
// Search
IndexReader reader = new IndexReader(..);
IndexSearcher searcher = new IndexSearcher(reader);
...
...
QueryParser parser = new QueryParser("FIELD_QUESTION", new StandardAnalyzer());
Query q = parser.parse(queryQuestion);
...
...
TopDocs topDocs = searcher.search(q, 10); // top-10 retrieved
// Accumulate the answers from the retrieved questions which
// are similar to the query (new) question.
StringBuffer buff = new StringBuffer();
for (ScoreDoc sd : topDocs.scoreDocs) {
Document retrievedDoc = reader.document(sd.doc);
buff.append(retrievedDoc.get("FIELD_ANSWER")).append("\n");
}
System.out.println("Generated answer: " + buff.toString());
I want to process a medium to large number of text snippets using a spelling/grammar checker to get a rough approximation and ranking of their "quality." Speed is not really of concern either, so I think the easiest way is to write a script that passes off the snippets to Microsoft Word (2007) and runs its spelling and grammar checker on them.
Is there a way to do this from a script (specifically, Python)? What is a good resource for learning about controlling Word programmatically?
If not, I suppose I can try something from Open Source Grammar Checker (SO).
Update
In response to Chris' answer, is there at least a way to a) open a file (containing the snippet(s)), b) run a VBA script from inside Word that calls the spelling and grammar checker, and c) return some indication of the "score" of the snippet(s)?
Update 2
I've added an answer which seems to work, but if anyone has other suggestions I'll keep this question open for some time.
It took some digging, but I think I found a useful solution. Following the advice at http://www.nabble.com/Edit-a-Word-document-programmatically-td19974320.html I'm using the win32com module (if the SourceForge link doesn't work, according to this Stack Overflow answer you can use pip to get the module), which allows access to Word's COM objects. The following code demonstrates this nicely:
import win32com.client, os
wdDoNotSaveChanges = 0
path = os.path.abspath('snippet.txt')
snippet = 'Jon Skeet lieks ponies. I can haz reputashunz? '
snippet += 'This is a correct sentence.'
file = open(path, 'w')
file.write(snippet)
file.close()
app = win32com.client.gencache.EnsureDispatch('Word.Application')
doc = app.Documents.Open(path)
print "Grammar: %d" % (doc.GrammaticalErrors.Count,)
print "Spelling: %d" % (doc.SpellingErrors.Count,)
app.Quit(wdDoNotSaveChanges)
which produces
Grammar: 2
Spelling: 3
which match the results when invoking the check manually from Word.