Python-Docx changing existing text creates weird problems

Python-Docx changing existing text creates weird problems - python

So I am trying to automate some work tasks to help me create reports for various assignments. I have a template report that I want to simply replace the placeholder text. My code works for the most part, but the method I am using comes up with some strange results. Here are the relevant sections of my current code:
def create_new_report(self):
report = Document('Template.docx')
# Change Headers First
for sec in report.sections:
head = sec.header
for para in head.paragraphs:
for run in para.runs:
self.replace_run_text(run)
# Then the Tables
for table in report.tables:
for row in table.rows:
for cell in row.cells:
for para in cell.paragraphs:
for run in para.runs:
self.replace_run_text(run)
# And finally the Body
for para in report.paragraphs:
for run in para.runs:
self.replace_run_text(run)
def replace_run_text(self, run):
# Takes the run, performs string.replace for args, and returns new run
text = run.text
for arg in self.args: # a list of keys and the text to replace them with
text = text.replace(arg[0], arg[1])
run.text = text
For the most part this works well. However, when running this, I have noticed that it has some weird consequences. For the header, I had to hard-code which specific paragraphs to work with because running this on the entire thing was deleting my company logo as an image.
In the body, this code will remove page breaks, or form text boxes. I break up everything to individual runs in order to retain all styling, and that seems to work well at least.
For now I have hard-coded around the idiosyncrasies that come up, but I want to be able to make changes to my template document and have it just work, rather than needing to change those hard-coded sections as well. Does anyone have any advice as to why this particular behavior is occurring?
It really doesn't make sense to me. Why is the page break or the logo being removed when they do not even contain any runs? Or at the very least, I can guarantee they do not contain any of the text keys that are being replaced. They shouldn't be being messed with at all. But they are. I would appreciate any insight that anyone has!

Related

Cannot access first page header (python-docx)

I have recently wrote python code to replace a word in MS office via python-docx. The code is working good for a couple of week. I can access through all the header in all page with this code below;
sections = doc.sections
for z in range (0, len(sections)):
header_section = doc.sections[z]
header = header_section.header
header_text = header.paragraphs[0]
However a fews days ago I found some issue. The code seem not to work with first page of the document (but still work in the later page). I have try to figure out why the code just stop working and it seem that it is something about different first page setting of the document (I try create a new document without different first page and it work just fine). Can anyone please kindly suggest what cause my code to stop working, so I could try to find the way to rewrote the code (Any MS Office update that may effect the code?). And if you have any idea how to access the header in the document with different first page setting, please kindly share. Thank in advance.

A section can have up to three headers (and three footers). These are the first-page header, odd-page header, and default-header. Most sections have only the default header, but if the first-page header is defined, it is used for the first page only. This is because the first page of a chapter, say, would typically not have a header or may have a different header than the "running" header through the rest of the chapter or other section.
So look to see whether you have a first-page header or odd-page header defined; that could explain the behavior you're seeing. See these pages in the documentation for more:
https://python-docx.readthedocs.io/en/latest/api/section.html#id1
https://python-docx.readthedocs.io/en/latest/user/hdrftr.html

This mystified me for a while, thanks for the tip to read the docs more closely. For those looking for a copy/paste, here's what I did:
def get_headers_and_footers(doc: Document):
header_txt = set()
for section in doc.sections:
for paragraph in section.header.paragraphs:
header_txt.add(paragraph.text)
for paragraph in section.first_page_header.paragraphs:
header_txt.add(paragraph.text)
for paragraph in section.even_page_header.paragraphs:
header_txt.add(paragraph.text)
footer_txt = set()
for section in doc.sections:
for paragraph in section.header.paragraphs:
footer_txt.add(paragraph.text)
for paragraph in section.first_page_header.paragraphs:
footer_txt.add(paragraph.text)
for paragraph in section.even_page_header.paragraphs:
footer_txt.add(paragraph.text)
return list(header_txt), list(footer_txt)

Is there a way to programmatically reject changes to a word document using python, while not deleting comments from it?

I have old version of a few word documents (word document with '.doc' extension) all of which have a lot of tracked changes in them. Most of the changes have comments associated with them.
I need to figure out a way to use python to reject all the changes that have been made in the documents, while retaining the comments.
I tried this with the new versions of word document('.docx' files) and faced no issues. All the changes were rejected and the word document still had all the comments in it. But when I tried to do it with the older versions of word document, all my comments got deleted.
I was using the following function at first with few different versions of the word file.
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
word.ActiveDocument.Revisions.RejectAll()
word.ActiveDocument.Save()
doc.Close(False)
I tried to use the above function with the original word document
I changed the extension of the file to '.docx' and tried the above function
I made a copy of the document and saved it in '.docx' format.
In all these cases the comments were deleted.
I then tried the following code:
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
nextRev = word.Selection.NextRevision()
while nextRev:
nextRev.Reject()
nextRev = word.Selection.NextRevision()
word.ActiveDocument.Save()
doc.Close(False)
For some reason this code was almost working. But on checking few of the documents again, I found that while most of the comments remained a couple of them were still deleted.
I think that since the comments are being deleted, they are probably a part of Revisions, in that case, is it possible to check if the revision is a comment or not. If not, can someone please suggest a way to ensure that no comments are deleted in the document on rejecting the changes.
Edit:
So, I found out that the comments that were getting deleted were added to the document when the 'Track Changes' option was active. I guess it made the comments as a part of the revision. So my first function works pretty well in case the comments are made once the 'Track Changes' option was not active.
But then, I have about more then twenty word documents (all of them a mix of doc and docx files), each of them have at least fifteen pages and over fifty comments.
I am using win32com.client. I am not too familiar with other packages that work with MS word. Any help would be appreciated.
Thanks!

Okay, so I was able to get a workaround for this by:
Creating a selection object and selecting the scope of the text marked by the comment.
Saving the range of the commented text into a range object.
Rejecting the tracked changes for the selected text.
Getting the new text based on the range object that was created in step 2.
This method takes a lot of time, though and the easiest way to extract the marked text is to ensure that comments are made when the word is not tracking the changes.
This is the code I am using now.
def reject_changes(path, doc_names):
word = win32.gencache.EnsureDispatch('Word.Application')
rejected_changes = []
for doc in doc_names:
#open the word document
wb = word.Documents.Open(rejected_doc)
wb.Activate()
current_doc = word.ActiveDocument
current_doc.TrackRevisions = False
text = ''
#iterating over the comments
for c in current_doc.Comments:
sentence_range = c.Scope #returns a range object of the text marked by comment
select_sentence = sentence_range.Select() #select the sentence marked by sentence_range
nextRev = word.Selection.NextRevision() #checks for the next revision in word
while nextRev:
#if the next revision is not within the sentence_range then skip.
if nextRev.Range.Start < sentence_range.Start or nextRev.Range.End > sentence_range.End:
break
else:
nextRev.Reject()
new_range = current_doc.Range(sentence_range.Start, sentence_range.End)
text = new_range.Text
nextRev = word.Selection.NextRevision()
author = c.Author
rejected_changes.append((doc,author,text,path))
current_doc.Save()
wb.Close(False)
return rejected_changes

python-docx - deleting first paragraph

When I create a new document with python-docx and add paragraphs, it starts at the very first line. But if I use an empty document (I need it because of the user defined styles) and add paragraphes the document would always start with an empty line. Is there any workaround?

You can call document._body.clear_content() before adding the first paragraph.
document = Document('my-document.docx')
document._body.clear_content()
# start adding new paragraphs and whatever ...
That will leave the document with no paragraphs, so when you add new ones they start at the beginning.
It does, however, leave the document in a technically invalid state. So if you didn't add any new paragraphs and then tried to open it with Word, you might get a repair error on loading.
But if the next thing you're doing is adding paragraphs of your own, this should work just fine.
Also, note that this is technically an "internal" method and is not part of the documented API. So there's no guarantee this method's name won't change in a future release. But frankly I can't see any reason to change or remove it, so I expect it's safe enough :)

python lxml xpath AttributeError (NoneType) with correct xpath and usually working

I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.
(The source file has been downloaded and processed with tagsoup.)
Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b
(in my script without tbody)
Here is an abbreviated version of my code:
forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)
XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"
XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"
for p in allposts:
unreg=0
username = None
username = p.find(XUSER).text #this is where it goes haywire
When the loop hits user "tompson" / position()=11 at the end of the file, I get
AttributeError: 'NoneType' object has no attribute 'text'
I've tried a lot of try except else finallys, but they weren't helpful.
I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...
The script works for hundreds of other files/sites of this forum.
This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...
#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)
Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11
Is position() uncapable of going >10 ? I don't think so?
Am I missing something?

Question answered!
I have found the answer...
I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...
The way I posted my problem, it was not visible either.
the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath
(It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable)
out of >20000 files less than 20 are afflicted, this one here just happened to be the first...
additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not.
So I made a better xpath:
'/html/body/table[tr/td[#width="20%"]]/tr[position()>1]'
and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:
parser = etree.XMLParser(encoding='ISO-8859-15')
t = etree.parse(forumfile, parser)
I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...
Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?

Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

You could use OpenOffice. It can open word files, and also can run python macros.

I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.