Use pywin32 to go to specified page in word doc - python

I have a long word document with over 100 tables. I am trying to allow users to select a page number via python to enter data into the table on the specified page within the word document. I am able to enter data into a table with the following code, but the problem is that the document is so long, it's not easy for a user to know which table number they are on when they are 80 pages into the word document (not every page has a table and some pages have multiple tables).
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Documents.Open(my_document_path)
doc = word.ActiveDocument
table = doc.Tables(51) #random selection for testing purposes
table.Cell(Row = 7, Column = 2).Range.Text = "test"
So what I need help with is extracting the table number on a page in a word document that is specified via user input (i.e., user specifies that they want to add data to page 13 so the code will determine that table 51 is on page 72).
If I record a macro in word for simply jumping to a page, this is the VB code...
Selection.GoTo What:=wdGoToPage, Which:=wdGoToNext, Name:="13"
I have tried translating this into Python using the following line of code, but it's not jumping to the correct page.
doc.GoTo(win32.constants.wdGoToPage, win32.constants.wdGoToNext, "13")

GoTo works with the Selection object, which is a property of the Word application, not a document. In the code in the question, word represents the Word application, so word.Selection.GoTo should work.
Note the subsitution of wdGoToAbsolute in the GoTo method call for wdGoToNext - that's "safer" for going to a specific page number.
In order to get the entire Range for a page it's possible to use a built-in bookmark name "\Page". This only works for the page where the selection is, which is why it's necessary to first go to the page. It's then possible to get the first table (or any other table index) on the page.
If the index number of the table in the document is also required, that can be calculated by getting the document's range, then setting the end-point to the end of the page's range.
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Documents.Open(my_document_path)
doc = word.ActiveDocument
word.Selection.GoTo(win32.constants.wdGoToPage, win32.constants.wdGoToAbsolute, "13")
rngPage = doc.Bookmarks("\Page").Range
table = rngPage.Tables(1) #first table on the page
table.Cell(Row = 7, Column = 2).Range.Text = "test"
#rngToPage = doc.Content
#rngToPage.End = rngPage.End
#tableIndex = rngToPage.Tables.Count
Note that I don't work with Python, so I'm not able to test the Python code. So watch out for syntax errors. For this reason, I've appended the VBA code I used to test the approach.
Sub GetTableCountOnPage()
Dim tbl As Word.Table
Dim sPage As String
Dim rngPage As Word.Range
sPage = InputBox("On which page is the table?")
Selection.GoTo What:=wdGoToPage, Name:=sPage
Set rngPage = Selection.Document.Bookmarks("\Page").Range
If rngPage.Tables.Count > 0 Then
Set tbl = rngPage.Tables(1)
tbl.Select
Dim rngToTable As Word.Range
Set rngToTable = Selection.Document.content
rngToTable.End = rngPage.End
Debug.Print rngToTable.Tables.Count & " to this point."
End If
End Sub

Related

How to extract text (PyPDF2) from specific location/span on PDF

I have already extracted a text from a PDF page to Text variable.
I'm looking to extract the number that comes after the string 'your number is' (14 length string was matched on span (982,996):
object=PyPDF2.PdfFileReader(filename)
Text = PageObj.extractText()
PageObj = object.getPage(0)
ResSearch = re.search(String, Text)
I'm getting a result: span = (982, 996) match = 'your number is'. Now all I need is to scrape the three digit text that comes after that ('your number is 105'), as the files are changing daily and the fetching should be dynamic.
Thank you everyone !!
The problem is about regex not pdf itself. Under the assumption that at most one match per page you can use search, otherwise use findall. Have a look at the doc on how to use group, section with (...).
import PyPDF2, re
filename = '' #
pdf_r = PyPDF2.PdfFileReader(open(filename, 'rb'))
text = pdf_r.getPage(0).extractText() # from 1st page or make a loop
if p := re.search(r'your number is (\d{3})', text):
my_number = int(p.groups()[0]) # as int
Use PyPDF4, the syntax is the same and it doesn't "have" such extractText issue:
from the doc:
This works well for some PDF files, but poorly for others, depending on the generator used. [...] Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Adding Image at the very beginning of an already existing docx document [duplicate]

I use Python-docx to generate Microsoft Word document.The user want that when he write for eg: "Good Morning every body,This is my %(profile_img)s do you like it?"
in a HTML field, i create a word document and i recuper the picture of the user from the database and i replace the key word %(profile_img)s by the picture of the user NOT at the END OF THE DOCUMENT. With Python-docx we use this instruction to add a picture:
document.add_picture('profile_img.png', width=Inches(1.25))
The picture is added to the document but the problem that it is added at the end of the document.
Is it impossible to add a picture in a specific position in a microsoft word document with python? I've not found any answers to this in the net but have seen people asking the same elsewhere with no solution.
Thanks (note: I'm not a hugely experiance programmer and other than this awkward part the rest of my code will very basic)
Quoting the python-docx documentation:
The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own. However, by digging a little deeper into the API you can place text on either side of the picture in its paragraph, or both.
When we "dig a little deeper", we discover the Run.add_picture() API.
Here is an example of its use:
from docx import Document
from docx.shared import Inches
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')
well, I don't know if this will apply to you but here is what I've done to set an image in a specific spot to a docx document:
I created a base docx document (template document). In this file, I've inserted some tables without borders, to be used as placeholders for images. When creating the document, first I open the template, and update the file creating the images inside the tables. So the code itself is not much different from your original code, the only difference is that I'm creating the paragraph and image inside a specific table.
from docx import Document
from docx.shared import Inches
doc = Document('addImage.docx')
tables = doc.tables
p = tables[0].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('resized.png',width=Inches(4.0), height=Inches(.7))
p = tables[1].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('teste.png',width=Inches(4.0), height=Inches(.7))
doc.save('addImage.docx')
Here's my solution. It has the advantage on the first proposition that it surrounds the picture with a title (with style Header 1) and a section for additional comments. Note that you have to do the insertions in the reverse order they appear in the Word document.
This snippet is particularly useful if you want to programmatically insert pictures in an existing document.
from docx import Document
from docx.shared import Inches
# ------- initial code -------
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
picPath = 'D:/Development/Python/aa.png'
r.add_picture(picPath)
r.add_text(' do you like it?')
document.save('demo.docx')
# ------- improved code -------
document = Document()
p = document.add_paragraph('Picture bullet section', 'List Bullet')
p = p.insert_paragraph_before('')
r = p.add_run()
r.add_picture(picPath)
p = p.insert_paragraph_before('My picture title', 'Heading 1')
document.save('demo_better.docx')
This is adopting the answer written by Robᵩ while considering more flexible input from user.
My assumption is that the HTML field mentioned by Kais Dkhili (orignal enquirer) is already loaded in docx.Document(). So...
Identify where is the related HTML text in the document.
import re
## regex module
img_tag = re.compile(r'%\(profile_img\)s') # declare pattern
for _p in enumerate(document.paragraphs):
if bool(img_tag.match(_p.text)):
img_paragraph = _p
# if and only if; suggesting img_paragraph a list and
# use append method instead for full document search
break # lose the break if want full document search
Replace desired image into placeholder identified as img_tag = '%(profile_img)s'
The following code is after considering the text contains only a single run
May be changed accordingly if condition otherwise
temp_text = img_tag.split(img_paragraph.text)
img_paragraph.runs[0].text = temp_text[0]
_r = img_paragraph.add_run()
_r.add_picture('profile_img.png', width = Inches(1.25))
img_paragraph.add_run(temp_text[1])
and done. document.save() it if finalised.
In case you are wondering what to expect from the temp_text...
[In]
img_tag.split(img_paragraph.text)
[Out]
['This is my ', ' do you like it?']
I spend few hours in it. If you need to add images to a template doc file using python, the best solution is to use python-docx-template library.
Documentation is available here
Examples available in here
This is variation on a theme. Letting I be the paragraph number in the specific document then:
p = doc.paragraphs[I].insert_paragraph_before('\n')
p.add_run().add_picture('Fig01.png', width=Cm(15))

Problem with python-docx putting pictures in a table

I am using python-docx to create a new document and then I add a table (rows=1,cols=5). Then I add a picture to each of the five cells. I have the code working but what I see from docx is not what I see when I use Word manually.
Specifically, if I set on "Show Formatting Marks" and then look at what was generated by docx, there is always a hard return in the beginning of each of the cells (put there by the add_paragraph method.) When I use Word manually, there is no hard return.
The result of the hard return is that each picture is down one line from where I want it to be. If I use Word, the pictures are where I expect them to be.
What is also strange is that on the docx document I can manually go in and single click next to the hard return, press the down cursor key once, and then press the Backspace key once and the hard return is deleted and the picture moves to the top of the cell.
So my question is, does anyone know of a way to get a picture in a table cell without having a hard return put in when the add_paragraph method is executed?
Any help would be greatly appreciated.
def paragraph_format_run(cell):
paragraph = cell.add_paragraph()
format = paragraph.paragraph_format
run = paragraph.add_run()
format.space_before = Pt(0)
format.space_after = Pt(0)
format.line_spacing = 1.0
format.alignment = WD_ALIGN_PARAGRAPH.CENTER
return paragraph, format, run
def main():
document = Document()
sections = document.sections
section = sections[0]
section.top_margin = Inches(1.0)
section.bottom_margin = Inches(1.0)
section.left_margin = Inches(0.75)
section.right_margin = Inches(0.75)
table = document.add_table(rows=1, cols=5)
table.allow_autofit = False
cells = table.rows[0].cells
for i in range(5):
pic_path = f"Table_Images\pic_{i}.jpg"
cell = cells[i]
cell.vertical_alignment = WD_ALIGN_VERTICAL.TOP
cell_p, cell_f, cell_r = paragraph_format_run(cell)
cell_r.add_picture(pic_path, width=Inches(1.25))
doc_path = "TableTest_1.docx"
document.save(doc_path)
Each blank cell in a newly created table contains a single empty paragraph. This is just one of those things about the Word format. I suppose it gives a place to put the insertion mark (flashing vertical cursor) when you're using the Word application. A completely empty cell would have no place to "click" into.
This requires that any code that adds content to a cell must treat the first paragraph differently. In short, you access the first paragraph as cell.paragraphs[0] and only create second and later paragraphs with cell.add_paragraph().
So in this particular case, the paragraph_format_run() function would change like this:
def paragraph_format_run(cell):
paragraph = cell.paragraphs[0]
...
This assumes a lot, like it only works when cell is empty, but given what you now know about cell paragraphs you may be able to adapt it to adding multiple images into a cell if later decide you need that.

How to update fields in MS Word with Python Docx

I am working on a Python program that needs to add caption texts in MS Word to Figures and Tables (with numbering). After adding the field however, the field does not appear in my Word-document until I update the field (it's just an empty space in my document, until I update the field, then it jumps to e.g. '2').
This is my code for adding the field:
def add_caption_number(self, field_code):
""" Add a caption number for the field
:argument
field_code: [string] the type of field e.g. 'Figure', 'Table'...
"""
# Set the pointer to the last paragraph (e.g. the 'Figure ' caption text)
run = self.last_paragraph.add_run()
r = run._r
# Add a Figure Number field xml element
fldChar = OxmlElement("w:fldChar")
fldChar.set(qn("w:fldCharType"), "begin")
r.append(fldChar)
instrText = OxmlElement("w:instrText")
instrText.text = " SEQ %s \* ARABIC" % field_code
r.append(instrText)
fldChar = OxmlElement("w:fldChar")
fldChar.set(qn("w:fldCharType"), "end")
r.append(fldChar)
self.last_paragraph is the last paragraph that has been added and field_code is to select whether to add a Figure or a Table caption number.
I have found an example for updating the fields, but this opens the following window upon opening the document:
def update_fields(save_path):
""" Automatically updates the fields when opening the word document """
namespace = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
doc = DocxTemplate(save_path)
element_updatefields = lxml.etree.SubElement(
doc.settings.element, f"{namespace}updateFields"
)
element_updatefields.set(f"{namespace}val", "true")
doc.save(save_path)
Is there a way to do this without the popup window and without adding macros to the Word document? This needs to work on MacOS and Windows btw.
The behavior described in the question is by design. Updating of fields is a potential security risk - there are some field types that can access external content. Therefore, dynamic content generated outside the Word UI needs user confirmation to update.
I know of only three ways to prevent displaying the prompt
Calculate the values and insert the field result during document generation. The fields will still be updatable, in the normal manner, but won't require updating when the document is opened the first time. (Leave out the code in the second part of the question.)
Use Word Automation Services (requires on-premise SharePoint) to open the document, which will update the fields (as in the second part of the question).
Include a VBA project that performs the field update in an AutoOpen macro. This, of course, means the document type must be macro-enabled (docm) and that macros are allowed to execute on the target installation (also a security risk, of course).

How to extract image from table in MS Word document with docx library?

I am working on a program that needs to extract two images from a MS Word document to use them in another document. I know where the images are located (first table in the document), but when I try to extract any information from the table (even just plain text), I get empty cells.
Here is the Word document that I want to extract the images from. I want to extract the 'Rentel' images from the first page (first table, row 0 and 1, column 2).
I have tried to try the following code:
from docxtpl import DocxTemplate
source_document = DocxTemplate("Source document.docx")
# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)
Which just gives me empty lines...
I have read on this discussion and this one that the problem might be that "contained in a wrapper element that Python Docx cannot read". They suggest altering the source document, but I want to be able to select any document that was previously created with the same template as a source document (so those documents also contain the same problem and I cannot change every document separately). So a Python-only solution is really the only way I can think about solving the problem.
Since I also only want those two specific images, extracting any random image from the xml by unzipping the Word file doesn't really suit my solution, unless I know which image name I need to extract from the unzipped Word file folders.
I really want this to work as it is part of my thesis (and I'm just an electromechanical engineer, so I don't know that much about software).
[EDIT]: Here is the xml code for the first image (source_document.tables[0].cell(0,2)._tc.xml) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first "visible" table. Cell (1,2) gives me all the rows in column 2 within the second "visible" table.
If the problem isn't directly solvable with Python Docx, is it a possibility to search for the image name or ID or something within the XML code and then add the image using this ID/name with Python Docx?
Well, the first thing that jumps out is that both of the cells (w:tc elements) you posted each contain a nested table. This is perhaps unusual, but certainly a valid composition. Maybe they did that so they could include a caption in a cell below the image or something.
To access the nested table you'd have to do something like:
outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---
I'm not sure that solves your whole problem, but it strikes me that this is two or more questions in the end, the first being: "Why isn't my table cell showing up?" and the second perhaps being "How do I get an image out of a table cell?" (once you've actually found the cell in question).
For the people who have the same problem, this is the code that helped me solve it:
First I extract the nested cell from the table using the following method:
#staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
"""
Returns the nested cell (table inside a table) of the *document*
:argument
table: [docx.Table] outer table from which to get the nested table
outer_row: [int] row of the outer table in which the nested table is
outer_column: [int] column of the outer table in which the nested table is
inner_row: [int] row in the nested table from which to get the nested cell
inner_column: [int] column in the nested table from which to get the nested cell
:return
inner_cell: [docx.Cell] nested cell
"""
# Get the global first cell
outer_cell = table.cell(outer_row, outer_column)
nested_table = outer_cell.tables[0]
inner_cell = nested_table.cell(inner_row, inner_column)
return inner_cell
Using this cell, I can get the xml code and extract the image from that xml code. Note:
I didn't set the image width and height because I wanted it to be the same
In the replace_logos_from_source method I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in the get_nested_cell method without adding extra arguments to replace_logos_from_source
def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
"""
Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
in which the logos are placed are nested tables, the source and target cells with *inner_row* and
*inner_column* are first extracted from the nested table.
:argument
source_document: [DocxTemplate] document from which to extract the image
target_document: [DocxTemplate] document to which to add the extracted image
inner_row: [int] row in the nested table from which to get the image
inner_column: [int] column in the nested table from which to get the image
:return
Nothing
"""
# Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)
# Get the xml code of the inner cell
inner_cell_xml = source_cell._tc.xml
# Get the image from the xml code
image_stream = self.get_image_from_xml(source_document, inner_cell_xml)
# Add the image to the target cell
paragraph = target_cell.paragraphs[0]
if image_stream: # If not None (image exists)
run = paragraph.add_run()
run.add_picture(image_stream)
else:
# Set the target cell text equal to the source cell text
paragraph.add_run(source_cell.text)
#staticmethod
def get_image_from_xml(source_document, xml_code):
"""
Returns the rId for an image in the *xml_code*
:argument
xml_code: [string] xml code from which to extract the image from
:return
image_stream: [BytesIO stream] the image to find
None if no image exists in the xml_file
"""
# Parse the xml code for the blip
xml_parser = minidom.parseString(xml_code)
items = xml_parser.getElementsByTagName('a:blip')
# Check if an image exists
if items:
# Extract the rId of the image
rId = items[0].attributes['r:embed'].value
# Get the blob of the image
source_document_part = source_document.part
image_part = source_document_part.related_parts[rId]
image_bytes = image_part._blob
# Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
image_stream = BytesIO(image_bytes)
return image_stream
# If no image exists
else:
return None
To call the method, I used:
# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2) # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2) # Client logo

Categories

Resources