I want to process a medium to large number of text snippets using a spelling/grammar checker to get a rough approximation and ranking of their "quality." Speed is not really of concern either, so I think the easiest way is to write a script that passes off the snippets to Microsoft Word (2007) and runs its spelling and grammar checker on them.
Is there a way to do this from a script (specifically, Python)? What is a good resource for learning about controlling Word programmatically?
If not, I suppose I can try something from Open Source Grammar Checker (SO).
Update
In response to Chris' answer, is there at least a way to a) open a file (containing the snippet(s)), b) run a VBA script from inside Word that calls the spelling and grammar checker, and c) return some indication of the "score" of the snippet(s)?
Update 2
I've added an answer which seems to work, but if anyone has other suggestions I'll keep this question open for some time.
It took some digging, but I think I found a useful solution. Following the advice at http://www.nabble.com/Edit-a-Word-document-programmatically-td19974320.html I'm using the win32com module (if the SourceForge link doesn't work, according to this Stack Overflow answer you can use pip to get the module), which allows access to Word's COM objects. The following code demonstrates this nicely:
import win32com.client, os
wdDoNotSaveChanges = 0
path = os.path.abspath('snippet.txt')
snippet = 'Jon Skeet lieks ponies. I can haz reputashunz? '
snippet += 'This is a correct sentence.'
file = open(path, 'w')
file.write(snippet)
file.close()
app = win32com.client.gencache.EnsureDispatch('Word.Application')
doc = app.Documents.Open(path)
print "Grammar: %d" % (doc.GrammaticalErrors.Count,)
print "Spelling: %d" % (doc.SpellingErrors.Count,)
app.Quit(wdDoNotSaveChanges)
which produces
Grammar: 2
Spelling: 3
which match the results when invoking the check manually from Word.
Related
I'm new to the Fitz library and am working on a project where I need to find a string in a PDF page. I'm running into a case where the text on the page that I'm searching on is hyphenated. I am aware of the TEXT_DEHYPHENATE flag that I can use in the search for function, but that doesn't work for me (as shown in the image here https://postimg.cc/zHZPdd6v ). I'm getting no cases when I search for the hyphenated string.
Python Script
LOC = "./test.pdf"
doc = fitz.open(LOC)
page = doc[1]
print(page.get_text())
found = page.search_for("lowcost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low-cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
for rect in found:
print(rect)
Output
Abstract
The objective of “XXXXXXXXXXXXXXXXXX” was design and assemble a low-
cost and efficient tool.
DONE
0
DONE
0
DONE
0
Can someone please point me to how I might be able to detect the hyphen in my file? Thank you!
Your first approach should work, look here:
# insert some hyphenated text
page.insert_textbox((100,100,300,300),"The objective of 'xxx' was design and assemble a low-\ncost and efficient tool.")
157.94699853658676
# now search for it again
page.search_for("lowcost") # 2 rectangles!
[Rect(159.3009796142578, 116.24800109863281, 175.8009796142578, 131.36199951171875),
Rect(100.0, 132.49501037597656, 120.17399597167969, 147.6090087890625)]
# each containing a text portion with hyphen removed
for rect in page.search_for("lowcost"):
print(page.get_textbox(rect))
low
cost
Without the original file there is no way to tell the reason for your failure.
Are you sure there really is text - and not e.g. an image or other hickups?
Edited: As per the comment of user #KJ below: PyMuPDF's C base library MuPDF regards all of the unicodes '-', 0xAD, 0x2010, 0x2011 as hyphens in this context. They all should work the same. Just reconfirmed it in an example.
So I have been struggling with this issue for what seems like forever now (I'm pretty new to Python). I am using Python 3.7 (need it to be 3.7 due to variations in the versions of packages I am using for the project) to develop an AI chatbot system that can converse with you based on your text input. The program reads the contents of a series of .yml files when it starts. In one of the .yml files I am developing a syntax for when the first 5 characters match a ^###^ pattern, it will instead execute the code and return the result of that execution rather than just output text back to the user. For example:
Normal Conversation:
- - What is AI?
- Artificial Intelligence is the branch of engineering and science devoted to constructing machines that think.
Service/Code-based conversation:
- - Say hello to me
- ^###^print("HELLO")
The idea is that when you ask it to say hello to you, the ^##^print("HELLO") string will be retrieved from the .yml file, the first 5 characters of the response will be removed, the response will be sent to a separate function in the python code where it will run the code and store the result into a variable which will be returned from the function into a variable that will give the nice, clean result of HELLO to the user. I realize that this may be a bit hard to follow, but I will straighten up my code and condense everything once I have this whole error resolved. As a side note: Oracle is just what I am calling the project. I'm not trying to weave Java into this whole mess.
THE PROBLEM is that it does not store the result of the code being run/executed/evaluated into the variable like it should.
My code:
def executecode(input):
print("The code to be executed is: ",input)
#note: the input may occasionally have single quotes and/or double quotes in the input string
result = eval("{}".format(input))
print ("The result of the code eval: ", result)
test = eval("2+2")
test
print(test)
return result
#app.route("/get")
def get_bot_response():
userText = request.args.get('msg')
print("Oracle INTERPRETED input: ", userText)
ChatbotResponse = str(english_bot.get_response(userText))
print("CHATBOT RESPONSE VARIABLE: ", ChatbotResponse)
#The interpreted string was a request due to the ^###^ pattern in front of the response in the custom .yml file
if ChatbotResponse[:5] == '^###^':
print("---SERVICE REQUEST---")
print(executecode(ChatbotResponse[5:]))
interpreter_response = executecode(ChatbotResponse[5:])
print("Oracle RESPONDED with: ", interpreter_response)
else:
print("Oracle RESPONDED with: ", ChatbotResponse)
return ChatbotResponse
When I run this code, this is the output:
Oracle INTERPRETED input: How much RAM do you have?
CHATBOT RESPONSE VARIABLE: ^###^print("HELLO")
---SERVICE REQUEST---
The code to be executed is: print("HELLO")
HELLO
The result of the code eval: None
4
None
The code to be executed is: print("HELLO")
HELLO
The result of the code eval: None
4
Oracle RESPONDED with: None
Output on the website interface
Essentially, need it to say HELLO for the "The result of the code eval:" output. This should get it to where the chatbot responds with HELLO in the web interface, which is the end goal here. It seems as if it IS executing the code due to the HELLO's after the "The code to be executed is:" output text. It's just not storing it into a variable like I need it to.
I have tried eval, exec, ast.literal_eval(), converting the input to string with str(), changing up the single and double quotes, putting \ before pairs of quotes, and a few other things. Whenever I get it to where the program interprets "print("HELLO")" when it executes the code, it complains about the syntax. Also, from several days of looking online I have figured out that exec and eval aren't generally favored due to a bunch of issues, however I genuinely do not care about that at the moment because I am trying to make something that works before I make something that is good and works. I have a feeling the problem is something small and stupid like it always is, but I have no idea what it could be. :(
I used these 2 resources as the foundation for the whole chatbot project:
Text Guide
Youtube Guide
Also, I am sorry for the rather lengthy and descriptive question. It's rare that I have to ask a question of my own on stackoverflow because if I have a question, it usually already has a good answer. It feels like I've tried everything at this point. If you have a better suggestion of how to do this whole system or you think I should try approaching this another way, I'm open to ideas.
Thank you for any/all help. It is very much appreciated! :)
The issue is that python's print() doesn't have a return value, meaning it will always return None. eval simply evaluates some expression, and returns back the return value from that expression. Since print() returns None, an eval of some print statement will also return None.
>>> from_print = print('Hello')
Hello
>>> from_eval = eval("print('Hello')")
Hello
>>> from_print is from_eval is None
True
What you need is a io stream manager! Here is a possible solution that captures any io output and returns that if the expression evaluates to None.
from contextlib import redirect_stout, redirect_stderr
from io import StringIO
# NOTE: I use the arg name `code` since `input` is a python builtin
def executecodehelper(code):
# Capture all potential output from the code
stdout_io = StringIO()
stderr_io = StringIO()
with redirect_stdout(stdout_io), redirect_stderr(stderr_io):
# If `code` is already a string, this should work just fine without the need for formatting.
result = eval(code)
return result, stdout_io.getvalue(), stderr_io.getvalue()
def executecode(code):
result, std_out, std_err = executecodehelper(code)
if result is None:
# This code didn't return anything. Maybe it printed something?
if std_out:
return std_out.rstrip() # Deal with trailing whitespace
elif std_err:
return std_err.rstrip()
else:
# Nothing was printed AND the return value is None!
return None
else:
return result
As a final note, this approach is heavily linked to eval since eval can only evaluate a single statement. If you want to extend your bot to multiple line statements, you will need to use exec, which changes the logic. Here's a great resource detailing the differences between eval and exec: What's the difference between eval, exec, and compile?
It is easy just convert try to create a new list and add the the updated values of that variable to it, for example:
if you've a variable name myVar store the values or even the questions no matter.
1- First declare a new list in your code as below:
myList = []
2- If you've need to answer or display the value through myVar then you can do like below:
myList.append(myVar)
and this if you have like a generator for the values instead if you need the opposite which means the values are already stored then you will just update the second step to be like the following:
myList[0]='The first answer of the first question'
myList[1]='The second answer of the second question'
ans here all the values will be stored in your list and you can also do this in other way, for example using loops is will be much better if you have multiple values or answers.
I want to take a file of one or more bibtex entries and output it as an html-formatted string. The specific style is not so important, but let's just say APA. Basically, I want the functionality of bibtex2html but with a Python API since I'm working in Django. A few people have asked similar questions here and here. I also found someone who provided a possible solution here.
The first issue I'm having is pretty basic, which is that I can't even get the above solutions to run. I keep getting errors similar to ModuleNotFoundError: No module named 'pybtex.database'; 'pybtex' is not a package. I definitely have pybtex installed and can make basic API calls in the shell no problem, but whenever I try to import pybtex.database.whatever or pybtex.plugin I keep getting ModuleNotFound errors. Is it maybe a python 2 vs python 3 thing? I'm using the latter.
The second issue is that I'm having trouble understanding the pybtex python API documentation. Specifically, from what I can tell it looks like the format_from_string and format_from_file calls are designed specifically for what I want to do, but I can't seem to get the syntax correct. Specifically, when I do
pybtex.format_from_file('foo.bib',style='html')
I get pybtex.plugin.PluginNotFound: plugin pybtex.style.formatting.html not found. I think I'm just not understanding how the call is supposed to work, and I can't find any examples of how to do it properly.
Here's a function I wrote for a similar use case--incorporating bibliographies into a website generated by Pelican.
from pybtex.plugin import find_plugin
from pybtex.database import parse_string
APA = find_plugin('pybtex.style.formatting', 'apa')()
HTML = find_plugin('pybtex.backends', 'html')()
def bib2html(bibliography, exclude_fields=None):
exclude_fields = exclude_fields or []
if exclude_fields:
bibliography = parse_string(bibliography.to_string('bibtex'), 'bibtex')
for entry in bibliography.entries.values():
for ef in exclude_fields:
if ef in entry.fields.__dict__['_dict']:
del entry.fields.__dict__['_dict'][ef]
formattedBib = APA.format_bibliography(bibliography)
return "<br>".join(entry.text.render(HTML) for entry in formattedBib)
Make sure you've installed the following:
pybtex==0.22.2
pybtex-apa-style==1.3
I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;
Peter Norvig tells you how implement spell checker in Python.
Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.
I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?
Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.
You could use OpenOffice. It can open word files, and also can run python macros.
I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.
how about saving the file as xml. then using python or something else and pull the data out of word and into the database.
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.