Paragraph Matching Python

Paragraph Matching Python - python

Background information
I have a Python script which generates word documents with the docx module. These documents are generated based on a log and then printed and stored as records. However, the log can be edited retroactively, so the document records need to be revised, and these revisions must be tracked. I'm not actually revising the documents, but generating a new one which shows the difference between what is currently in the log, and what will soon be in the log (the log is updated after the revised file is printed). When a revision occurs, my script uses diff_match_patch to generate a mark-up of what's changed with the following function:
def revFinder(str1,str2):
dmp = dmp_module.diff_match_patch()
diffs = dmp.diff_main(str1,str2)
paratext = []
for diff in diffs:
paratext.append((diff[1], '' if diff[0] == 0 else ('s' if diff[0] == -1 else 'b')))
return paratext
docx can take text either as strings, or by tuple if word-by-word formatting is required, so [see second bullet in "Some Things to Note"]
[("Hello, ", ''), ("my name ", 'b'), ("is Brad", 's')]
produces
Hello my name is Brad
The Problem
diff_match_patch is a very efficient code which finds the difference between two texts. Unfortuanly, its a little too efficient, so replacing redundant with dune results in
redunante
This is ugly, but its fine for single words. However, if an entire paragraph gets replaced, the results will be entirely unreadable. That is not ok.
Previously I addressed this by collapsing all the text into a single paragraph, but this was less than ideal because it became very cluttered and was still pretty ugly.
The Solution So Far
I have a function which creates the revision document. This function gets passed a list of tuples set up like this:
[(fieldName, original, revised)]
So the document is set up as
Orignial fieldName (With Markup)
result of revFinder diffing orignal and revised
Revised fieldName
revised
I assume in order to resolve the problem, I'll need to do some sort of matching between paragraphs to make sure I don't diff two completely separate paragraphs. I'm also assuming this matching will depend if paragraphs are added or removed. Here's the code I have so far:
if len(item[1].split('\n')) + len(item[1].split('\n'))) == 2:
body.append(heading("Original {} (With Markup)".format(item[0]),2))
body.append(paragraph(revFinder(item[1],item[2])))
body.append(paragraph("",style="BodyTextKeep"))
body.append(heading("Revised {}".format(item[0]),2))
body.append(paragraph(item[2]))
body.append(paragraph(""))
else:
diff = len(item[1].split('\n')) - len(item[1].split('\n'))
if diff == 0:
body.append(heading("Original {} (With Markup)".format(item[0]),2))
for orPara, revPara in zip(item[1].split('\n'),item[2].split('\n')):
body.append(paragraph(revFinder(orPara,revPara)))
body.append(paragraph("",style="BodyTextKeep"))
body.append(heading("Revised {}".format(item[0]),2))
for para in item[2].split('\n'):
body.append(paragraph("{}".format(para)))
body.append(paragraph(""))
elif diff > 0:
#Removed paragraphs
elif diff < 0:
#Added paragraphs
So far I've planned on using something like difflib to do paragraph matching. But if there's a better way to avoid this problem that is a completely different approach, that's great too.
Some Things to Note:
I'm running Python 2.7.6 32-bit on Windows 7 64-bit
I've made some changes to my local copy of docx (namely adding the strike through formatting) so if you test this code you will not be able to replicate what I'm doing in that regard
Description of the Entire Process (with the revision steps in bold):
1) User opens Python script and uses GUI to add information to a thing called a "Condition Report" (CR)
NOTE: A full CR contains 4 parts, all completed by different people. But each part gets individually printed. All 4 parts are
stored together in the log
2) When the user is finished, the information is saved to a log (described below), and then printed as a .docx file
3) The printed document is signed and stored
4) When the user wants to revise a part of the CR, the open the GUI, and edit the information in each of the fields. I am only concerned about a few of the fields in this question, and those are the multiline text controls (which can result in multiple paragraphs)
5) Once the user is done with the revision, the code generates the tuple list I described in the "Solution So Far" section, and sends this to the function which generates the revision document
6) The revision document is created, printed, signed, and stored with the original document for that part of that CR
7) The log is completely rewritten to include the revised information
The Log:
The log is simply a giant dict which stores all the information on all of the CRs. The general format is
{"Unique ID Number": [list of CR info]}
The log doesn't store past versions of a CR, so when a CR is revised the old information is overwritten (which is what we want for the system). As I mentioned earlier, every time the log is edited, the whole thing is rewritten. To get at the information in the log, I import it (since it always lives in the same directory as the script)

Try using the post-diff cleanup options that diff_match_patch that #tzaman mentioned above, in particular, check out the diff_cleanupSemantic function which is intended for use when the diff output is intended to be human-readable.
Cleanup options are NOT run automatically, since diff_match_patch provides several cleanup options from which you may choose (depending on your needs).
Here is an example:
import diff_match_patch
dmp = diff_match_patch.diff_match_patch()
diffs = dmp.diff_main('This is my original paragraph.', 'My paragraph is much better now.')
print diffs # pre-cleanup
dmp.diff_cleanupSemantic(diffs)
print diffs # post cleanup
Output:
[(-1, 'This is m'), (1, 'M'), (0, 'y'), (-1, ' original'), (0, ' paragraph'), (1, ' is much better now'), (0, '.')]
[(-1, 'This is my original paragraph'), (1, 'My paragraph is much better now'), (0, '.')]
As you can see, the first diff is optimal but unreadable, while the second dif (after cleanup) is exactly what you are looking for.

consider using git to manage all these revisions, see GitPython for a python api, also see
Git (or Hg) plugin for dealing with Microsoft Word and/or OpenOffice files for how to xmldiff and have one element per line

Related

Is there a way to programmatically reject changes to a word document using python, while not deleting comments from it?

I have old version of a few word documents (word document with '.doc' extension) all of which have a lot of tracked changes in them. Most of the changes have comments associated with them.
I need to figure out a way to use python to reject all the changes that have been made in the documents, while retaining the comments.
I tried this with the new versions of word document('.docx' files) and faced no issues. All the changes were rejected and the word document still had all the comments in it. But when I tried to do it with the older versions of word document, all my comments got deleted.
I was using the following function at first with few different versions of the word file.
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
word.ActiveDocument.Revisions.RejectAll()
word.ActiveDocument.Save()
doc.Close(False)
I tried to use the above function with the original word document
I changed the extension of the file to '.docx' and tried the above function
I made a copy of the document and saved it in '.docx' format.
In all these cases the comments were deleted.
I then tried the following code:
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
nextRev = word.Selection.NextRevision()
while nextRev:
nextRev.Reject()
nextRev = word.Selection.NextRevision()
word.ActiveDocument.Save()
doc.Close(False)
For some reason this code was almost working. But on checking few of the documents again, I found that while most of the comments remained a couple of them were still deleted.
I think that since the comments are being deleted, they are probably a part of Revisions, in that case, is it possible to check if the revision is a comment or not. If not, can someone please suggest a way to ensure that no comments are deleted in the document on rejecting the changes.
Edit:
So, I found out that the comments that were getting deleted were added to the document when the 'Track Changes' option was active. I guess it made the comments as a part of the revision. So my first function works pretty well in case the comments are made once the 'Track Changes' option was not active.
But then, I have about more then twenty word documents (all of them a mix of doc and docx files), each of them have at least fifteen pages and over fifty comments.
I am using win32com.client. I am not too familiar with other packages that work with MS word. Any help would be appreciated.
Thanks!

Okay, so I was able to get a workaround for this by:
Creating a selection object and selecting the scope of the text marked by the comment.
Saving the range of the commented text into a range object.
Rejecting the tracked changes for the selected text.
Getting the new text based on the range object that was created in step 2.
This method takes a lot of time, though and the easiest way to extract the marked text is to ensure that comments are made when the word is not tracking the changes.
This is the code I am using now.
def reject_changes(path, doc_names):
word = win32.gencache.EnsureDispatch('Word.Application')
rejected_changes = []
for doc in doc_names:
#open the word document
wb = word.Documents.Open(rejected_doc)
wb.Activate()
current_doc = word.ActiveDocument
current_doc.TrackRevisions = False
text = ''
#iterating over the comments
for c in current_doc.Comments:
sentence_range = c.Scope #returns a range object of the text marked by comment
select_sentence = sentence_range.Select() #select the sentence marked by sentence_range
nextRev = word.Selection.NextRevision() #checks for the next revision in word
while nextRev:
#if the next revision is not within the sentence_range then skip.
if nextRev.Range.Start < sentence_range.Start or nextRev.Range.End > sentence_range.End:
break
else:
nextRev.Reject()
new_range = current_doc.Range(sentence_range.Start, sentence_range.End)
text = new_range.Text
nextRev = word.Selection.NextRevision()
author = c.Author
rejected_changes.append((doc,author,text,path))
current_doc.Save()
wb.Close(False)
return rejected_changes

How to detect an empty paragraph in python-docx

Given a document containing a paragraph
d = docx.Document()
p = d.add_paragraph()
I expected the following technique to work every time:
if len(p._element) == 0:
# p is empty
OR
if len(p._p) == 0:
# p is empty
(Side question, what's the difference there? It seems that p._p is p._element in every case I've seen in the wild.)
If I add a style to my paragraph, the check no longer works:
>>> p2 = d.add_paragraph(style="Normal")
>>> print(len(p2._element))
1
Explicitly setting text=None doesn't help either, not that I would expect it to.
So how to I check if a paragraph is empty of content (specifically text and images, although more generic is better)?
Update
I messed around a little and found that setting the style apparently adds a single pPr element:
>>> p2._element.getchildren()
[<CT_PPr '<w:pPr>' at 0x7fc9a2b64548>]
The element itself it empty:
>>> len(p2._element.getchildren()[0])
0
But more importantly, it is not a run.
So my test now looks like this:
def isempty(par):
return sum(len(run) for run in par._element.xpath('w:r')) == 0
I don't know enough about the underlying system to have any idea if this is a reasonable solution or not, and what the caveats are.
More Update
Seems like I need to be able to handle a few different cases here:
def isempty(par):
p = par._p
runs = p.xpath('./w:r[./*[not(self::w:rPr)]]')
others = p.xpath('./*[not(self::w:pPr) and not(self::w:r)] and '
'not(contains(local-name(), "bookmark"))')
return len(runs) + len(others) == 0
This skips all w:pPr elements and runs with nothing but w:rPr elements. Any other element, except bookmarks, whether in the paragraph directly or in a run, will make the result non-empty.

The <w:p> element can have any of a large number of children, as you can see from the XML Schema excerpt here: http://python-docx.readthedocs.io/en/latest/dev/analysis/schema/ct_p.html (see the CT_P and EG_PContent definitions).
In particular, it often has a w:pPr child, which is where the style setting goes.
So your test isn't very reliable against false positives (if being empty is considered positive).
I'd be inclined to use paragraph.text == '', which parses through the runs.
A run can be empty (of text), so the mere presence of a run is not proof enough. The actual text is held in a a:t (text) element, which can also be empty. So the .text approach avoids all those low-level complications for you and has the benefit of being part of the API so much, much less likely to change in a future release.

Python: Writing text to a 2003 Word Doc in a specific place on the page

I'm using Python 2.7, Windows 7, and Word 2003. Those three cannot change (well except for maybe the python version). I work in Law and the attorneys have roughly 3 boiler plate objections (just a large piece of text, maybe 5 paragraphs) that need to be inserted into a word document at a specific spot. Now instead of going through and copying and pasting the objection where its needed, my idea is for the user to go through the document adding a special word/phrase (place holder if you will) that wont be found anywhere in the document. Then run some code and have python fill in the rest. Maybe not the cleverest way to go about it, but I'm a noob. I've been practicing with a test page and inserted the below text as place holders (the extra "o" stands for objection)
oone
otwo
othree
Below is what I have so far. I have two questions
Do you have any other suggestions to go about this?
My code does insert the string in the correct order, but the formatting goes out the window and it writes in my string 6 times instead of 1. How can I resolve the formatting issue so it simply writes the text into the spot the place holder is at?
import sys
import fileinput
f = open('work.doc', 'r+')
obj1 = "oone"
obj2 = "otwo"
obj3 = "othree"
for line in fileinput.input('work.doc'):
if obj1 in line:
f.write("Objection 1")
elif obj2 in line:
f.write("Objection 2")
elif obj3 in line:
f.write("Objection 3")
else:
f.write("No Objection")
f.close

You could use python-uno to load the document into OpenOffice and manipulate it using the UNO interface. There is some example code on the site I just linked to which can get you started.

What stronger alternatives are there to difflib?

I am working on script that needs to be able to track revisions. The general idea is to give it a list of tuples where the first entry is the name of a field (ie "title" or "description" etc.), the second entry is the first version of that field, and the third entry is the revised version. So something like this:
[("Title", "The first version of the title", "The second version of the title")]
Now, using python docx I want my script to create a word file that will show the original version, and the new version with the changes bolded. Example:
Original Title:
This is the first version of the title
Revised Title:
This is the second version of the title
The way that this is done in python docx is to create a list of tuples, where the first entry is the text, and the second one is the formatting. So the way to create the revised title would be this:
paratext = [("This is the ", ''),("second",'b'),(" version of the title",'')]
Having recent discovered difflib I figured this would be a pretty easy task. And indeed, for simple word replacements such as sample above, it is, and can be done with the following function:
def revFinder(str1,str2):
s = difflib.SequenceMatcher(None, str1, str2)
matches = s.get_matching_blocks()[:-1]
paratext = []
for i in range(len(matches)):
print "------"
print str1[matches[i][0]:matches[i][0]+matches[i][2]]
print str2[matches[i][1]:matches[i][1]+matches[i][2]]
paratext.append((str2[matches[i][1]:matches[i][1]+matches[i][2]],''))
if i != len(matches)-1:
print ""
print str1[matches[i][0]+matches[i][2]:matches[i+1][0]]
print str2[matches[i][1]+matches[i][2]:matches[i+1][1]]
if len(str2[matches[i][1]+matches[i][2]:matches[i+1][1]]) > len(str1[matches[i][0]+matches[i][2]:matches[i+1][0]]):
paratext.append((str2[matches[i][1]+matches[i][2]:matches[i+1][1]],'bu'))
else:
paratext.append((str1[matches[i][0]+matches[i][2]:matches[i+1][0]],'bu'))
return paratext
The problems come when I want to do anything else. For example, changing 'teh' to 'the' produces t h e h (without the spaces, I couldn't figure out the formatting). Another issue is that extra text appended to the end is not shown as a change (or at all).
So, my question to all of you is what alternatives are there to difflib which are powerful enough to handle more complicated text comparions, or, how can I use difflib better such that it works for what I want? Thanks in advance

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?

Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

You could use OpenOffice. It can open word files, and also can run python macros.

I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Paragraph Matching Python - python

consider using git to manage all these revisions, see GitPython for a python api, also see Git (or Hg) plugin for dealing with Microsoft Word and/or OpenOffice files for how to xmldiff and have one element per line

Related

Is there a way to programmatically reject changes to a word document using python, while not deleting comments from it?

How to detect an empty paragraph in python-docx

Python: Writing text to a 2003 Word Doc in a specific place on the page

What stronger alternatives are there to difflib?

Extracting data from MS Word

Categories

Resources