How to update fields in MS Word with Python Docx - python

I am working on a Python program that needs to add caption texts in MS Word to Figures and Tables (with numbering). After adding the field however, the field does not appear in my Word-document until I update the field (it's just an empty space in my document, until I update the field, then it jumps to e.g. '2').
This is my code for adding the field:
def add_caption_number(self, field_code):
""" Add a caption number for the field
:argument
field_code: [string] the type of field e.g. 'Figure', 'Table'...
"""
# Set the pointer to the last paragraph (e.g. the 'Figure ' caption text)
run = self.last_paragraph.add_run()
r = run._r
# Add a Figure Number field xml element
fldChar = OxmlElement("w:fldChar")
fldChar.set(qn("w:fldCharType"), "begin")
r.append(fldChar)
instrText = OxmlElement("w:instrText")
instrText.text = " SEQ %s \* ARABIC" % field_code
r.append(instrText)
fldChar = OxmlElement("w:fldChar")
fldChar.set(qn("w:fldCharType"), "end")
r.append(fldChar)
self.last_paragraph is the last paragraph that has been added and field_code is to select whether to add a Figure or a Table caption number.
I have found an example for updating the fields, but this opens the following window upon opening the document:
def update_fields(save_path):
""" Automatically updates the fields when opening the word document """
namespace = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
doc = DocxTemplate(save_path)
element_updatefields = lxml.etree.SubElement(
doc.settings.element, f"{namespace}updateFields"
)
element_updatefields.set(f"{namespace}val", "true")
doc.save(save_path)
Is there a way to do this without the popup window and without adding macros to the Word document? This needs to work on MacOS and Windows btw.

The behavior described in the question is by design. Updating of fields is a potential security risk - there are some field types that can access external content. Therefore, dynamic content generated outside the Word UI needs user confirmation to update.
I know of only three ways to prevent displaying the prompt
Calculate the values and insert the field result during document generation. The fields will still be updatable, in the normal manner, but won't require updating when the document is opened the first time. (Leave out the code in the second part of the question.)
Use Word Automation Services (requires on-premise SharePoint) to open the document, which will update the fields (as in the second part of the question).
Include a VBA project that performs the field update in an AutoOpen macro. This, of course, means the document type must be macro-enabled (docm) and that macros are allowed to execute on the target installation (also a security risk, of course).

Related

Python-Docx changing existing text creates weird problems

So I am trying to automate some work tasks to help me create reports for various assignments. I have a template report that I want to simply replace the placeholder text. My code works for the most part, but the method I am using comes up with some strange results. Here are the relevant sections of my current code:
def create_new_report(self):
report = Document('Template.docx')
# Change Headers First
for sec in report.sections:
head = sec.header
for para in head.paragraphs:
for run in para.runs:
self.replace_run_text(run)
# Then the Tables
for table in report.tables:
for row in table.rows:
for cell in row.cells:
for para in cell.paragraphs:
for run in para.runs:
self.replace_run_text(run)
# And finally the Body
for para in report.paragraphs:
for run in para.runs:
self.replace_run_text(run)
def replace_run_text(self, run):
# Takes the run, performs string.replace for args, and returns new run
text = run.text
for arg in self.args: # a list of keys and the text to replace them with
text = text.replace(arg[0], arg[1])
run.text = text
For the most part this works well. However, when running this, I have noticed that it has some weird consequences. For the header, I had to hard-code which specific paragraphs to work with because running this on the entire thing was deleting my company logo as an image.
In the body, this code will remove page breaks, or form text boxes. I break up everything to individual runs in order to retain all styling, and that seems to work well at least.
For now I have hard-coded around the idiosyncrasies that come up, but I want to be able to make changes to my template document and have it just work, rather than needing to change those hard-coded sections as well. Does anyone have any advice as to why this particular behavior is occurring?
It really doesn't make sense to me. Why is the page break or the logo being removed when they do not even contain any runs? Or at the very least, I can guarantee they do not contain any of the text keys that are being replaced. They shouldn't be being messed with at all. But they are. I would appreciate any insight that anyone has!

Use pywin32 to go to specified page in word doc

I have a long word document with over 100 tables. I am trying to allow users to select a page number via python to enter data into the table on the specified page within the word document. I am able to enter data into a table with the following code, but the problem is that the document is so long, it's not easy for a user to know which table number they are on when they are 80 pages into the word document (not every page has a table and some pages have multiple tables).
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Documents.Open(my_document_path)
doc = word.ActiveDocument
table = doc.Tables(51) #random selection for testing purposes
table.Cell(Row = 7, Column = 2).Range.Text = "test"
So what I need help with is extracting the table number on a page in a word document that is specified via user input (i.e., user specifies that they want to add data to page 13 so the code will determine that table 51 is on page 72).
If I record a macro in word for simply jumping to a page, this is the VB code...
Selection.GoTo What:=wdGoToPage, Which:=wdGoToNext, Name:="13"
I have tried translating this into Python using the following line of code, but it's not jumping to the correct page.
doc.GoTo(win32.constants.wdGoToPage, win32.constants.wdGoToNext, "13")
GoTo works with the Selection object, which is a property of the Word application, not a document. In the code in the question, word represents the Word application, so word.Selection.GoTo should work.
Note the subsitution of wdGoToAbsolute in the GoTo method call for wdGoToNext - that's "safer" for going to a specific page number.
In order to get the entire Range for a page it's possible to use a built-in bookmark name "\Page". This only works for the page where the selection is, which is why it's necessary to first go to the page. It's then possible to get the first table (or any other table index) on the page.
If the index number of the table in the document is also required, that can be calculated by getting the document's range, then setting the end-point to the end of the page's range.
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Documents.Open(my_document_path)
doc = word.ActiveDocument
word.Selection.GoTo(win32.constants.wdGoToPage, win32.constants.wdGoToAbsolute, "13")
rngPage = doc.Bookmarks("\Page").Range
table = rngPage.Tables(1) #first table on the page
table.Cell(Row = 7, Column = 2).Range.Text = "test"
#rngToPage = doc.Content
#rngToPage.End = rngPage.End
#tableIndex = rngToPage.Tables.Count
Note that I don't work with Python, so I'm not able to test the Python code. So watch out for syntax errors. For this reason, I've appended the VBA code I used to test the approach.
Sub GetTableCountOnPage()
Dim tbl As Word.Table
Dim sPage As String
Dim rngPage As Word.Range
sPage = InputBox("On which page is the table?")
Selection.GoTo What:=wdGoToPage, Name:=sPage
Set rngPage = Selection.Document.Bookmarks("\Page").Range
If rngPage.Tables.Count > 0 Then
Set tbl = rngPage.Tables(1)
tbl.Select
Dim rngToTable As Word.Range
Set rngToTable = Selection.Document.content
rngToTable.End = rngPage.End
Debug.Print rngToTable.Tables.Count & " to this point."
End If
End Sub

Is there a way to programmatically reject changes to a word document using python, while not deleting comments from it?

I have old version of a few word documents (word document with '.doc' extension) all of which have a lot of tracked changes in them. Most of the changes have comments associated with them.
I need to figure out a way to use python to reject all the changes that have been made in the documents, while retaining the comments.
I tried this with the new versions of word document('.docx' files) and faced no issues. All the changes were rejected and the word document still had all the comments in it. But when I tried to do it with the older versions of word document, all my comments got deleted.
I was using the following function at first with few different versions of the word file.
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
word.ActiveDocument.Revisions.RejectAll()
word.ActiveDocument.Save()
doc.Close(False)
I tried to use the above function with the original word document
I changed the extension of the file to '.docx' and tried the above function
I made a copy of the document and saved it in '.docx' format.
In all these cases the comments were deleted.
I then tried the following code:
def reject_changes(path):
doc = word.Documents.Open(path)
doc.Activate()
word.ActiveDocument.TrackRevisions = False
nextRev = word.Selection.NextRevision()
while nextRev:
nextRev.Reject()
nextRev = word.Selection.NextRevision()
word.ActiveDocument.Save()
doc.Close(False)
For some reason this code was almost working. But on checking few of the documents again, I found that while most of the comments remained a couple of them were still deleted.
I think that since the comments are being deleted, they are probably a part of Revisions, in that case, is it possible to check if the revision is a comment or not. If not, can someone please suggest a way to ensure that no comments are deleted in the document on rejecting the changes.
Edit:
So, I found out that the comments that were getting deleted were added to the document when the 'Track Changes' option was active. I guess it made the comments as a part of the revision. So my first function works pretty well in case the comments are made once the 'Track Changes' option was not active.
But then, I have about more then twenty word documents (all of them a mix of doc and docx files), each of them have at least fifteen pages and over fifty comments.
I am using win32com.client. I am not too familiar with other packages that work with MS word. Any help would be appreciated.
Thanks!
Okay, so I was able to get a workaround for this by:
Creating a selection object and selecting the scope of the text marked by the comment.
Saving the range of the commented text into a range object.
Rejecting the tracked changes for the selected text.
Getting the new text based on the range object that was created in step 2.
This method takes a lot of time, though and the easiest way to extract the marked text is to ensure that comments are made when the word is not tracking the changes.
This is the code I am using now.
def reject_changes(path, doc_names):
word = win32.gencache.EnsureDispatch('Word.Application')
rejected_changes = []
for doc in doc_names:
#open the word document
wb = word.Documents.Open(rejected_doc)
wb.Activate()
current_doc = word.ActiveDocument
current_doc.TrackRevisions = False
text = ''
#iterating over the comments
for c in current_doc.Comments:
sentence_range = c.Scope #returns a range object of the text marked by comment
select_sentence = sentence_range.Select() #select the sentence marked by sentence_range
nextRev = word.Selection.NextRevision() #checks for the next revision in word
while nextRev:
#if the next revision is not within the sentence_range then skip.
if nextRev.Range.Start < sentence_range.Start or nextRev.Range.End > sentence_range.End:
break
else:
nextRev.Reject()
new_range = current_doc.Range(sentence_range.Start, sentence_range.End)
text = new_range.Text
nextRev = word.Selection.NextRevision()
author = c.Author
rejected_changes.append((doc,author,text,path))
current_doc.Save()
wb.Close(False)
return rejected_changes

Haystack + Xapian: Can't get autocomplete functionality working

I'm trying to get autocomplete working on my server for search. Here is an example of one of my indexer classes:
class ArtistIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
artist_name = indexes.CharField(model_attr='clean_artist_name', null=True)
submitted_date = indexes.DateTimeField(model_attr='submitted_date')
total_count = indexes.IntegerField(model_attr='total_count')
# This is used for autocomplete
content_auto = indexes.NgramField(use_template=True)
def get_model(self):
return Artist
def index_queryset(self, using=None):
""" Used when the entire index of a model is updated. """
return self.get_model().objects.filter(date_submitted__lte=datetime.now())
def get_updated_field(self):
return "last_data_change"
The text and content_auto fields are populated using templates which, in the case of Artsts, is just the artist name. According to the docs, something like this should work for autocomplete:
objResultSet = SearchQuerySet().models(Artist).autocomplete(content_auto=search_term)
However, trying this with the string "bill w" returns Bill Stephney as the top result and then Bill Withers as the second result. This is because Bill Stephney has more records in the database, but Stephney shouldn't be matching this query: once the "w" is detected it should only match Bill Withers (and other Bill Ws). I've also tried wildcards:
objResultSet = SearchQuerySet().models(Artist).filter(content_auto=search_term + '*')
and
objResultSet = SearchQuerySet().models(Artist).filter(text=AutoQuery(search_term + '*'))
but the wildcard seems to cause a load of problems, with the development server hanging and eventually stopping due to a Write Failed: Broken Pipe error with a cryptic stack trace, all of which is within the Python framework. Has anyone managed to get this working properly? Is NgramField the right type to use? I've tried using EdgeNgramField but that gave me similar results.
I believe the Haystack documentation recommends EdgeNgramField for "standard text," which I assume is English. They recommend NgramField for Asian languages or if you want to match across word boundaries. I.e., I think you want your content_auto to use EdgeNgramField:
content_auto = indexes.EdgeNgramField(use_template=True)
Also, since n-grams are not exactly wildcard searches (in the way we use * [the asterisk] in shell script glob matches, for example), you should not use * in your filter.
One thing I have found that makes a difference in the search results are the parameters you can tweak in the backend engine -- there are settings for the n-gram tokenizer and n-gram filter. Depending on the search engine backend you're using, changing the min_gram values will affect the results you get in your matches.
I've only used the elasticsearch backend so I don't know if other backends are as sensitive to these n-gram settings as the solr/elasticsearch ones. Basically, I created a custom backend based on the default one that comes with haystack and tweaked the min_gram values to test the matches. The higher value you set the more "accurate" the match is since it has to match a longer token.
See this question on using a backend with custom n-gram settings for elasticsearch:
EdgeNgramField min and max letters in django haystack

Django: How to modify a text field before showing it in admin

I have a Django Model with a text field. I would like to modify the content of the text field before it's presented to the user in Django Admin.
I was expecting to see signal equivalent of post_load but it doesn't seem to exist.
To be more specific:
I have a text field that takes user input. In this text field there is a read more separator. Text before the separator is going to go into introtext field, everything after goes into fulltext field.
At the same time, I only want to show the user 1 text field when they're editing the article.
My plan was to on_load read the data from introtext and fulltext field and combine them into fulltext textarea. On pre_save, I would split the text using the read more separator and store intro in introtext and remainder in fulltext.
So, before the form is displayed, I need to populate the fulltext field with
introtext + '<!--readmore-->' + fulltext
and I need to be able to do this for existing items.
Have a look into Providing your own form for the admin pages.
Once you have your own form, you can use the default param in the form to provide the initial value you want. See the docs on the Initial param for the form field. As this link will show you, it is possible to use a callable or a constant as your initial value.
There is no post_load because there is no load function.
Loading of the instance is done in init function, therefore the right answer is to use post_init signal.

Categories

Resources