I am working on script that needs to be able to track revisions. The general idea is to give it a list of tuples where the first entry is the name of a field (ie "title" or "description" etc.), the second entry is the first version of that field, and the third entry is the revised version. So something like this:
[("Title", "The first version of the title", "The second version of the title")]
Now, using python docx I want my script to create a word file that will show the original version, and the new version with the changes bolded. Example:
Original Title:
This is the first version of the title
Revised Title:
This is the second version of the title
The way that this is done in python docx is to create a list of tuples, where the first entry is the text, and the second one is the formatting. So the way to create the revised title would be this:
paratext = [("This is the ", ''),("second",'b'),(" version of the title",'')]
Having recent discovered difflib I figured this would be a pretty easy task. And indeed, for simple word replacements such as sample above, it is, and can be done with the following function:
def revFinder(str1,str2):
s = difflib.SequenceMatcher(None, str1, str2)
matches = s.get_matching_blocks()[:-1]
paratext = []
for i in range(len(matches)):
print "------"
print str1[matches[i][0]:matches[i][0]+matches[i][2]]
print str2[matches[i][1]:matches[i][1]+matches[i][2]]
paratext.append((str2[matches[i][1]:matches[i][1]+matches[i][2]],''))
if i != len(matches)-1:
print ""
print str1[matches[i][0]+matches[i][2]:matches[i+1][0]]
print str2[matches[i][1]+matches[i][2]:matches[i+1][1]]
if len(str2[matches[i][1]+matches[i][2]:matches[i+1][1]]) > len(str1[matches[i][0]+matches[i][2]:matches[i+1][0]]):
paratext.append((str2[matches[i][1]+matches[i][2]:matches[i+1][1]],'bu'))
else:
paratext.append((str1[matches[i][0]+matches[i][2]:matches[i+1][0]],'bu'))
return paratext
The problems come when I want to do anything else. For example, changing 'teh' to 'the' produces t h e h (without the spaces, I couldn't figure out the formatting). Another issue is that extra text appended to the end is not shown as a change (or at all).
So, my question to all of you is what alternatives are there to difflib which are powerful enough to handle more complicated text comparions, or, how can I use difflib better such that it works for what I want? Thanks in advance
Related
I'm trying to parse a markdown document with a regex to find if there is a title in the document (# title).
I've manage to achieve this with this regex (?m)^#{1}(?!#) (.*), the problem is that I can also have code section in my markdown where I can encounter the # title format as a comment.
My idea was to try to find the # title, but if in lines before there is a ```language then don't match.
Here is a text example where I need to only match # my title and not the # helloworld.py below, especially if # my title is missing (which is what I need to find out) :
<!--
.. title: Structuring a Python application
.. medium: yes
.. devto: yes
-->
# my title
In this short article I will explain all the different ways of structuring a Python application, from a quick script to a more complex web application.
## Single python file containing all the code
```python
#!/usr/bin/env python
# helloworld.py
test
This could get real messy with regex. But since it seems like you'll be using python anyway - this can be trivial.
mkdwn = '''<!--
.. title: Structuring a Python application
.. medium: yes
.. devto: yes
-->
# my title
In this short article I will explain all the different ways of structuring a Python application, from a quick script to a more complex web application.
## Single python file containing all the code
```python
#!/usr/bin/env python
# helloworld.py
test'''
'''Get the first occurrence of a substring that
you're 100% certain **will not be present** before the title
but **will be present** in the document after the title (if the title exists)
'''
idx = mkdwn.index('```')
# Now, try to extract the title using regex, starting from the string start but ending at `idx`
title_match = re.search(r'^# (.+)', mkdwn[:idx],flags=re.M)
# Get the 1st group if a match was found, else empty string
title = title_match.group(1) if title_match else ''
print(title)
You can also reduce this
title_match = re.search(r'^# (.+)', mkdwn[:idx],flags=re.M)
# Get the 1st group if a match was found, else empty string
title = title_match.group(1) if title_match else ''
into a one liner, if you're into that sort of thing-
title = getattr(re.search(r'^# (.+)', mkdwn[:idx],flags=re.M), 'group', lambda _: '')(1)
getattr will return the attribute group if present (i.e when match was found) - otherwise it'll just return that dummy function(lambda _: '') which takes a dummy argument and returns an empty string, to be assigned to title.
The returned function is then called with the argument 1, which returns the 1st group if a match was found. If a match wasn't found, well the argument doesn't matter, it just returns an empty string.
Output
my title
This is a task for three regular expressions. Screen all code fragments temporarily with the first one, process markdown with the second one, unscreen code with the third.
"Screening" means storing code fragments in a dictionary and replacing with some special markdown with dictionary key.
I'm using Python 2.7, Windows 7, and Word 2003. Those three cannot change (well except for maybe the python version). I work in Law and the attorneys have roughly 3 boiler plate objections (just a large piece of text, maybe 5 paragraphs) that need to be inserted into a word document at a specific spot. Now instead of going through and copying and pasting the objection where its needed, my idea is for the user to go through the document adding a special word/phrase (place holder if you will) that wont be found anywhere in the document. Then run some code and have python fill in the rest. Maybe not the cleverest way to go about it, but I'm a noob. I've been practicing with a test page and inserted the below text as place holders (the extra "o" stands for objection)
oone
otwo
othree
Below is what I have so far. I have two questions
Do you have any other suggestions to go about this?
My code does insert the string in the correct order, but the formatting goes out the window and it writes in my string 6 times instead of 1. How can I resolve the formatting issue so it simply writes the text into the spot the place holder is at?
import sys
import fileinput
f = open('work.doc', 'r+')
obj1 = "oone"
obj2 = "otwo"
obj3 = "othree"
for line in fileinput.input('work.doc'):
if obj1 in line:
f.write("Objection 1")
elif obj2 in line:
f.write("Objection 2")
elif obj3 in line:
f.write("Objection 3")
else:
f.write("No Objection")
f.close
You could use python-uno to load the document into OpenOffice and manipulate it using the UNO interface. There is some example code on the site I just linked to which can get you started.
I created a word document which contains the text
Hello. You owe me ${debt}. Please pay me back soon.
in Times New Roman size 12. The file name is debtTemplate.docx. I would like to replace {debt} by an actual number (1.20) using python-docx. I tried that following code:
from docx import Document
document = Document("debtTemplate.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText)
document.save("debt.docx")
This results in a new document with the desired text, but in Calabri font size 11. I would like the font to be like the original: Times New Roman size 12.
I know that you can add a style variable to paragraph.add_run(), so I tried that but nothing work. Eg paragraph.add_run(newText,style="Strong") didn't even change anything.
Does anyone know what I can do?
EDIT: here's a modified version of my code that I had hoped would work but didn't.
from docx import Document
document = Document("debtTemplate.docx")
document.save("debt.docx")
paragraphs = document.paragraphs
debt = "1.20"
paragraph = paragraphs[0]
style = paragraph.style
text = paragraph.text
newText = text.format(debt=debt)
paragraph.clear()
paragraph.add_run(newText,style)
document.save("debt.docx")
This page in the docs should help you understand why the style is not having an effect. It's a pretty easy fix: http://python-docx.readthedocs.org/en/latest/user/styles.html
I like a couple other things about what you've found though:
Using the str.format() method to do placeholder replacement is a nice, easy way to do lightweight text replacement. I'll have to add that to the documentation as an approach to simple custom document generation.
In the XML for a paragraph, there is an optional element called <w:defRPr> which Word uses to indicates the default formatting for any new text added to the paragraph, like if you started typing after placing your insertion point at the end of the paragraph. Right now, python-docx ignores that element. That's why you're getting the default Calibri 11 instead of the Times New Roman 12 you started with. But a useful feature might be to use that element, if present, to assign run properties to any new runs added at the end of the paragraph. If you want to add that as a feature request to the GitHub tracker we'll take a look at getting it implemented.
Background information
I have a Python script which generates word documents with the docx module. These documents are generated based on a log and then printed and stored as records. However, the log can be edited retroactively, so the document records need to be revised, and these revisions must be tracked. I'm not actually revising the documents, but generating a new one which shows the difference between what is currently in the log, and what will soon be in the log (the log is updated after the revised file is printed). When a revision occurs, my script uses diff_match_patch to generate a mark-up of what's changed with the following function:
def revFinder(str1,str2):
dmp = dmp_module.diff_match_patch()
diffs = dmp.diff_main(str1,str2)
paratext = []
for diff in diffs:
paratext.append((diff[1], '' if diff[0] == 0 else ('s' if diff[0] == -1 else 'b')))
return paratext
docx can take text either as strings, or by tuple if word-by-word formatting is required, so [see second bullet in "Some Things to Note"]
[("Hello, ", ''), ("my name ", 'b'), ("is Brad", 's')]
produces
Hello my name is Brad
The Problem
diff_match_patch is a very efficient code which finds the difference between two texts. Unfortuanly, its a little too efficient, so replacing redundant with dune results in
redunante
This is ugly, but its fine for single words. However, if an entire paragraph gets replaced, the results will be entirely unreadable. That is not ok.
Previously I addressed this by collapsing all the text into a single paragraph, but this was less than ideal because it became very cluttered and was still pretty ugly.
The Solution So Far
I have a function which creates the revision document. This function gets passed a list of tuples set up like this:
[(fieldName, original, revised)]
So the document is set up as
Orignial fieldName (With Markup)
result of revFinder diffing orignal and revised
Revised fieldName
revised
I assume in order to resolve the problem, I'll need to do some sort of matching between paragraphs to make sure I don't diff two completely separate paragraphs. I'm also assuming this matching will depend if paragraphs are added or removed. Here's the code I have so far:
if len(item[1].split('\n')) + len(item[1].split('\n'))) == 2:
body.append(heading("Original {} (With Markup)".format(item[0]),2))
body.append(paragraph(revFinder(item[1],item[2])))
body.append(paragraph("",style="BodyTextKeep"))
body.append(heading("Revised {}".format(item[0]),2))
body.append(paragraph(item[2]))
body.append(paragraph(""))
else:
diff = len(item[1].split('\n')) - len(item[1].split('\n'))
if diff == 0:
body.append(heading("Original {} (With Markup)".format(item[0]),2))
for orPara, revPara in zip(item[1].split('\n'),item[2].split('\n')):
body.append(paragraph(revFinder(orPara,revPara)))
body.append(paragraph("",style="BodyTextKeep"))
body.append(heading("Revised {}".format(item[0]),2))
for para in item[2].split('\n'):
body.append(paragraph("{}".format(para)))
body.append(paragraph(""))
elif diff > 0:
#Removed paragraphs
elif diff < 0:
#Added paragraphs
So far I've planned on using something like difflib to do paragraph matching. But if there's a better way to avoid this problem that is a completely different approach, that's great too.
Some Things to Note:
I'm running Python 2.7.6 32-bit on Windows 7 64-bit
I've made some changes to my local copy of docx (namely adding the strike through formatting) so if you test this code you will not be able to replicate what I'm doing in that regard
Description of the Entire Process (with the revision steps in bold):
1) User opens Python script and uses GUI to add information to a thing called a "Condition Report" (CR)
NOTE: A full CR contains 4 parts, all completed by different people. But each part gets individually printed. All 4 parts are
stored together in the log
2) When the user is finished, the information is saved to a log (described below), and then printed as a .docx file
3) The printed document is signed and stored
4) When the user wants to revise a part of the CR, the open the GUI, and edit the information in each of the fields. I am only concerned about a few of the fields in this question, and those are the multiline text controls (which can result in multiple paragraphs)
5) Once the user is done with the revision, the code generates the tuple list I described in the "Solution So Far" section, and sends this to the function which generates the revision document
6) The revision document is created, printed, signed, and stored with the original document for that part of that CR
7) The log is completely rewritten to include the revised information
The Log:
The log is simply a giant dict which stores all the information on all of the CRs. The general format is
{"Unique ID Number": [list of CR info]}
The log doesn't store past versions of a CR, so when a CR is revised the old information is overwritten (which is what we want for the system). As I mentioned earlier, every time the log is edited, the whole thing is rewritten. To get at the information in the log, I import it (since it always lives in the same directory as the script)
Try using the post-diff cleanup options that diff_match_patch that #tzaman mentioned above, in particular, check out the diff_cleanupSemantic function which is intended for use when the diff output is intended to be human-readable.
Cleanup options are NOT run automatically, since diff_match_patch provides several cleanup options from which you may choose (depending on your needs).
Here is an example:
import diff_match_patch
dmp = diff_match_patch.diff_match_patch()
diffs = dmp.diff_main('This is my original paragraph.', 'My paragraph is much better now.')
print diffs # pre-cleanup
dmp.diff_cleanupSemantic(diffs)
print diffs # post cleanup
Output:
[(-1, 'This is m'), (1, 'M'), (0, 'y'), (-1, ' original'), (0, ' paragraph'), (1, ' is much better now'), (0, '.')]
[(-1, 'This is my original paragraph'), (1, 'My paragraph is much better now'), (0, '.')]
As you can see, the first diff is optimal but unreadable, while the second dif (after cleanup) is exactly what you are looking for.
consider using git to manage all these revisions, see GitPython for a python api, also see
Git (or Hg) plugin for dealing with Microsoft Word and/or OpenOffice files for how to xmldiff and have one element per line
Found this great answer on how to check if a list of strings are within a line
How to check if a line has one of the strings in a list?
But trying to do a similar thing with keys in a dict does not seem to do the job for me:
import urllib2
url_info = urllib2.urlopen('http://rss.timegenie.com/forex.xml')
currencies = {"DKK": [], "SEK": []}
print currencies.keys()
testCounter = 0
for line in url_info:
if any(countryCode in line for countryCode in currencies.keys()):
testCounter += 1
if "DKK" in line or "SEK" in line:
print line
print "testCounter is %i and should be 2 - if not debug the code" % (testCounter)
The output:
['SEK', 'DKK']
<code>DKK</code>
<code>SEK</code>
testCounter is 377 and should be 2 - if not debug the code
Think that perhaps my problem is because that .keys() gives me an array rather than a list.. But haven't figured out how to convert it..
change:
any(countryCode in line for countryCode in currencies.keys())
to:
any([countryCode in line for countryCode in currencies.keys()])
Your original code uses a generator expression whereas (I think) your intention is a list comprehension.
see: Generator Expressions vs. List Comprehension
UPDATE:
I found that using an ipython interpreter with pylab imported I got the same results as you did (377 counts versus the anticipated 2). I realized the issue was that 'any' was from the numpy package which is meant to work on an array.
Next, I loaded an ipython interpreter without pylab such that 'any' was from builtin. In this case your original code works.
So if your using an ipython interpreter type:
help(any)
and make sure it is from the builtin module. If so your original code should work fine.
This is not a very good way to examine an xml file.
It's slow. You are making potentially N*M substring searches where N is the number of lines and M is the number of keys.
XML is not a line-oriented text format. Your substring searches could find attribute names or element names too, which is probably not what you want. And if the XML file happens to put all its elements on one line with no whitespace (common for machine-generated and -processed XML) you will get fewer matches than you expect.
If you have line-oriented text input, I suggest you construct a regex from your list of keys:
import re
linetester = re.compile('|'.join(re.escape(key) for key in currencies))
for match in linetester.finditer(entire_text):
print match.group(0)
#or if entire_text is too long and you want to consume iteratively:
for line in entire_text:
for match in linetester.find(line):
print match.group(0)
However, since you have XML, you should use an actual XML processor:
import xml.etree.cElementTree as ET
for elem in forex.findall('data/code'):
if elem.text in currencies:
print elem.text
If you are only interested in what codes are present and don't care about the particular entry you can use set intersection:
codes = frozenset(e.text for e in forex.findall('data/code'))
print codes & frozenset(currencies)