PDF - Fitz makes the merged page as flipped - python

Below is the code I use to add a watermark onto pdf pages. On some pages the watermark looks like flipped upside down (rotated 180 degrees and looks like in the mirror).
doc_report = fitz.open(report_pdf_path)
doc_watermark = fitz.open(watermark_pdf_path)
for i in xrange(doc_report.pageCount):
page = doc_report.loadPage(i)
page_front = fitz.open()
page_front.insertPDF(doc_watermark, from_page=i, to_page=i)
page.showPDFpage(page.rect, page_front, pno=0, keep_proportion=True, overlay=True, rotate=0, clip=None)
doc_report.save(save_path, encryption=fitz.PDF_ENCRYPT_KEEP)
doc_report.close()
doc_watermark.close()
While debugging I compared the rotation, transformation properties of the target and watermark page, they look identical.
Could you please advise how can I resolve this?

Thanks to K J, here is the updated code resolving the issue.
doc_report = fitz.open(report_pdf_path)
doc_watermark = fitz.open(watermark_pdf_path)
for i in xrange(doc_report.pageCount):
page = doc_report.loadPage(i)
page_front = fitz.open()
# added this
if not page._isWrapped:
page._wrapContents()
page_front.insertPDF(doc_watermark, from_page=i, to_page=i)
page.showPDFpage(page.rect, page_front, pno=0, keep_proportion=True, overlay=True, rotate=0, clip=None)
doc_report.save(save_path, encryption=fitz.PDF_ENCRYPT_KEEP)
doc_report.close()
doc_watermark.close()

Related

How Can I Set a Table Cell to Autofit in Word with Python

I have written a python program to take images picked by the user and insert them into a word document. For each image, the program with create a 1x2 table. The top cell will be the image and the bottom cell will be the image name.
My issue is that when running the program, the created tables don't autofit the images. It leaves the image at full scale cutting off most of the image. (see image below)
What my Program does
If you just run word on it's own and create a 1x2 table and insert an image it will automatically set the scale of the image to fit the whole image. This is what I would like my program to do. (see image below)
What I want my program to do
I am using the docx python library. See below for Libraries used
Libraries
I found this article:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/table/table-props.html
and tried "table.allow_autofit = True" but this did not work. See code below
Create Table
I can manually set the image size as seen commented out in my code, but I would like to not have to do this.
EDIT-
Below is my entire function that I need help with. Please let me know if you need anymore. I didn't post all the code because it's a little long. I am using tkinter to ask the user to enter a title and select the files they would like to import.
def auto_gen():
doc = docx.Document()
section = doc.sections[0]
header = section.header
header_para = header.paragraphs[0]
header_para.style = doc.styles.add_style('Style Name', WD_STYLE_TYPE.PARAGRAPH)
font = header_para.style.font
font.size = Pt(20)
font.bold = True
header_para.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
header_para.text = eid.get()
f = filedialog.askopenfilenames()
for x in f:
caption = os.path.basename(x)
f_name, f_ext = os.path.splitext(caption)
table = doc.add_table(rows=2, cols=1)
table.style = 'Table Grid'
run1 = table.cell(0,0).paragraphs[0].add_run()
run2p = table.cell(1,0).paragraphs[0]
run2 = table.cell(1,0).paragraphs[0].add_run()
run1.add_picture(x)#, height = Inches(3.38), width = Inches(6))
run2.text = (f_name)
run2p.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph()
output = eid.get() + '.docx'
doc.save(output)

Is there a way to resize all the pages of a PDF to one size in Python?

Essentially, I'm looking to resize all of the pdf pages in a document to be the same size as the first page (or any set dimensions i.e. A4). This is because it's causing issues for mapping coordinates on a frontend UI I am developing. The result I am hoping for is, that if for example, I have a PDF document with a landscape page, this will be mapped onto an A4 page and take up half the new page. Could anyone point me to any resources or code that might help me do this kind of thing?
disclaimer I am the author of borb, the library used in this answer.
second disclaimer: It's doable, but not easy.
You can use borb to read the PDF. That is the easy part.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle)
# check whether we have read a Document
assert doc is not None
if __name__ == "__main__":
main()
Now that you have a representation of the Document, you need to obtain the size of the first Page.
pi: PageInfo = doc.get_page(0).get_page_info()
w: Decimal = pi.get_width() or Decimal(0)
h: Decimal = pi.get_height() or Decimal(0)
Now, in every Page (except the first one) you need to update the content stream. The content stream is a sequence of postscript operators that actually renders the content in the PDF.
Luckily for you, there is a command to change the entire coordinate-system of the Page you are working on. This concept is called the transformation matrix.
Every operation will first change its x/y coordinates by applying this 3x3 transformation matrix.
Conversely, by modifying that matrix you are able to scale/translate/rotate all the content inside the Page.
The matrix has this form:
[[ a b 0 ]
[ c d 0 ]
[ e f 1 ]]
The third column is always [0 0 1], so it is not needed.
The Tm command takes 6 arguments (the remaining values) and sets the corresponding values in the transformation matrix.
So you'd need to do something like this:
content_stream = page["Contents"]
instructions: bytes = b"a b c d e f Tm\n" + content_stream["DecodedBytes"]
content_stream[Name("DecodedBytes")] += instructions.encode("latin1")
content_stream[Name("Bytes")] = zlib.compress(content_stream["DecodedBytes"], 9)
content_stream[Name("Length")] = bDecimal(len(content_stream["Bytes"]))

Python PDF Parsing with Camelot and Extract the Table Title

Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table.
The code I'm using for extracting tables from pdf is this:
import camelot
tables = camelot.read_pdf('test.pdf', pages='all',lattice=True, suppress_stdout = True)
I'd like to extract the text written above the table i.e THE PARTICULARS, as shown in the image below.
What should be a best approach for me to do it? appreciate any help. thank you
You can create the Lattice parser directly
parser = Lattice(**kwargs)
for p in pages:
t = parser.extract_tables(p, suppress_stdout=suppress_stdout,
layout_kwargs=layout_kwargs)
tables.extend(t)
Then you have access to parser.layout which contains all the components in the page. These components all have bbox (x0, y0, x1, y1) and the extracted tables also have a bbox object. You can find the closest component to the table on top of it and extract the text.
Here's my hilariously bad implementation just so that someone can laugh and get inspired to do a better one and contribute to the great camelot package :)
Caveats:
Will only work for non-rotated tables
It's a heuristic
The code is bad
# Helper methods for _bbox
def top_mid(bbox):
return ((bbox[0]+bbox[2])/2, bbox[3])
def bottom_mid(bbox):
return ((bbox[0]+bbox[2])/2, bbox[1])
def distance(p1, p2):
return math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
def get_closest_text(table, htext_objs):
min_distance = 999 # Cause 9's are big :)
best_guess = None
table_mid = top_mid(table._bbox) # Middle of the TOP of the table
for obj in htext_objs:
text_mid = bottom_mid(obj.bbox) # Middle of the BOTTOM of the text
d = distance(text_mid, table_mid)
if d < min_distance:
best_guess = obj.get_text().strip()
min_distance = d
return best_guess
def get_tables_and_titles(pdf_filename):
"""Here's my hacky code for grabbing tables and guessing at their titles"""
my_handler = PDFHandler(pdf_filename) # from camelot.handlers import PDFHandler
tables = camelot.read_pdf(pdf_filename, pages='2,3,4')
print('Extracting {:d} tables...'.format(tables.n))
titles = []
with camelot.utils.TemporaryDirectory() as tempdir:
for table in tables:
my_handler._save_page(pdf_filename, table.page, tempdir)
tmp_file_path = os.path.join(tempdir, f'page-{table.page}.pdf')
layout, dim = camelot.utils.get_page_layout(tmp_file_path)
htext_objs = camelot.utils.get_text_objects(layout, ltype="horizontal_text")
titles.append(get_closest_text(table, htext_objs)) # Might be None
return titles, tables
See: https://github.com/atlanhq/camelot/issues/395

Python-pptx: copy slide

How can I copy slide?
I created a template slide and I need to copy it and edit shapes of each copy separately.
Or how I can add my template slide to presentation.slide_layouts?
This is what I found on GitHub, and it works for me. I did change a couple of things for my project. You will need to import six and copy. I am using pptx-6.10
def duplicate_slide(pres, index):
template = pres.slides[index]
try:
blank_slide_layout = pres.slide_layouts[12]
except:
blank_slide_layout = pres.slide_layouts[len(pres.slide_layouts)]
copied_slide = pres.slides.add_slide(blank_slide_layout)
for shp in template.shapes:
el = shp.element
newel = copy.deepcopy(el)
copied_slide.shapes._spTree.insert_element_before(newel, 'p:extLst')
for _, value in six.iteritems(template.part.rels):
# Make sure we don't copy a notesSlide relation as that won't exist
if "notesSlide" not in value.reltype:
copied_slide.part.rels.add_relationship(
value.reltype,
value._target,
value.rId
)
return copied_slide
Then you can create the copy with passing in your presentation and the slide index of your template:
copied_slide = duplicate_slide(pres, 4)
I am still working on editing the shapes from the copied slide, once I am further along in my project I can update
I wanted to present my workaround to copy slides. I use a template ppt and populate it. I know before populating the slides which slides of the template need to be copied and how often. What I then do is copying the slides and saving the new ppt with the copied slides. After saving I can open the ppt with the copied slides and use pptx to populate the slides.
import win32com.client
ppt_instance = win32com.client.Dispatch('PowerPoint.Application')
#open the powerpoint presentation headless in background
read_only = True
has_title = False
window = False
prs = ppt_instance.Presentations.open('path/ppt.pptx',read_only,has_title,window)
nr_slide = 1
insert_index = 1
prs.Slides(nr_slide).Copy()
prs.Slides.Paste(Index=insert_index)
prs.SaveAs('path/new_ppt.pptx')
prs.Close()
#kills ppt_instance
ppt_instance.Quit()
del ppt_instance
In this case the firste slide would be copied of the presentation and inserted after the first slide of the same presentation.
Hope this helps some of you!
Since I also found another usecase for the code shared by #d_bergeron, I just wanted to share it here.
In my case, I wanted to copy a slide from another presentation into the one I generated with python-pptx:
As argument I pass in the Presentation() object I created using python-pptx (prs = Presenation()).
from pptx import Presentation
import copy
def copy_slide_from_external_prs(prs):
# copy from external presentation all objects into the existing presentation
external_pres = Presentation("PATH/TO/PRES/TO/IMPORT/from.pptx")
# specify the slide you want to copy the contents from
ext_slide = external_pres.slides[0]
# Define the layout you want to use from your generated pptx
SLD_LAYOUT = 5
slide_layout = prs.slide_layouts[SLD_LAYOUT]
# create now slide, to copy contents to
curr_slide = prs.slides.add_slide(slide_layout)
# now copy contents from external slide, but do not copy slide properties
# e.g. slide layouts, etc., because these would produce errors, as diplicate
# entries might be generated
for shp in ext_slide.shapes:
el = shp.element
newel = copy.deepcopy(el)
curr_slide.shapes._spTree.insert_element_before(newel, 'p:extLst')
return prs
I am mainly posting it here, since I was looking for a way to copy an external slide into my presentation and ended up in this thread.
I was using n00by0815's answer and it worked great until I had to copy images. Here is my adapted version that handles images. This code creates a local copy of the image then adds it to the slide. I'm sure there's a cleaner way, but this works.
Here is another way to copy each slide onto a single PPTX slide for an entire presentation, and then you can use LibreOffice to convert each individual powerpoint into an image:
def get_slide_count(prs):
""" Get the number of slides in PPTX presentation """
slidecount = 0
for slide in prs.slides:
slidecount += 1
return slidecount
def delete_slide(prs, slide):
""" Delete a slide out of a powerpoint presentation"""
id_dict = { slide.id: [i, slide.rId] for i,slide in enumerate(prs.slides._sldIdLst) }
slide_id = slide.slide_id
prs.part.drop_rel(id_dict[slide_id][1])
del prs.slides._sldIdLst[id_dict[slide_id][0]]
def get_single_slide_pres(prs, slidetokeep):
for idx, slide in enumerate(prs.slides):
if idx < slidetokeep:
delete_slide(prs, slide)
elif (idx > slidetokeep):
delete_slide(prs, slide)
prs.save(str(slidetokeep + 1) + ".pptx")
pptxfilepath = "test.pptx"
prs = Presentation(pptxfilepath)
slidecount = get_slide_count(prs)
for i in range(slidecount):
prs_backup = Presentation(pptxfilepath)
get_single_slide_pres(prs_backup, i)
prs_backup = None
I edited #n00by0815 solution and came up with very elegant code, which also can copy images without errors:
# ATTENTNION: PPTX PACKAGE RUNS ONLY ON CERTAINS VERSION OF PYTHON (https://python-pptx.readthedocs.io/en/latest/user/install.html)
from pptx import Presentation
from pptx.util import Pt
from pptx.enum.text import PP_ALIGN
import copy
import os
DIR_PATH = os.path.dirname(os.path.realpath(__file__))
#modeled on https://stackoverflow.com/a/56074651/20159015 and https://stackoverflow.com/a/62921848/20159015
#this for some reason doesnt copy text properties (font size, alignment etc.)
def SlideCopyFromPasteInto(copyFromPres, slideIndex, pasteIntoPres):
# specify the slide you want to copy the contents from
slide_to_copy = copyFromPres.slides[slideIndex]
# Define the layout you want to use from your generated pptx
slide_layout = pasteIntoPres.slide_layouts.get_by_name("Blank") # names of layouts can be found here under step 3: https://www.geeksforgeeks.org/how-to-change-slide-layout-in-ms-powerpoint/
# it is important for slide_layout to be blank since you dont want these "Write your title here" or something like that textboxes
# alternative: slide_layout = pasteIntoPres.slide_layouts[copyFromPres.slide_layouts.index(slide_to_copy.slide_layout)]
# create now slide, to copy contents to
new_slide = pasteIntoPres.slides.add_slide(slide_layout)
# create images dict
imgDict = {}
# now copy contents from external slide, but do not copy slide properties
# e.g. slide layouts, etc., because these would produce errors, as diplicate
# entries might be generated
for shp in slide_to_copy.shapes:
if 'Picture' in shp.name:
# save image
with open(shp.name+'.jpg', 'wb') as f:
f.write(shp.image.blob)
# add image to dict
imgDict[shp.name+'.jpg'] = [shp.left, shp.top, shp.width, shp.height]
else:
# create copy of elem
el = shp.element
newel = copy.deepcopy(el)
# add elem to shape tree
new_slide.shapes._spTree.insert_element_before(newel, 'p:extLst')
# things added first will be covered by things added last => since I want pictures to be in foreground, I will add them after others elements
# you can change this if you want
# add pictures
for k, v in imgDict.items():
new_slide.shapes.add_picture(k, v[0], v[1], v[2], v[3])
os.remove(k)
return new_slide # this returns slide so you can instantly work with it when it is pasted in presentation
templatePres = Presentation(f"{DIR_PATH}/template.pptx")
outputPres = Presentation()
outputPres.slide_height, outputPres.slide_width = templatePres.slide_height, templatePres.slide_width
# this can sometimes cause problems. Alternative:
# outputPres = Presentation(f"{DIR_PATH}/template.pptx") and now delete all slides to have empty presentation
# if you just want to copy and paste slide:
SlideCopyFromPasteInto(templatePres,0,outputPres)
# if you want to edit slide that was just pasted in presentation:
pastedSlide = SlideCopyFromPasteInto(templatePres,0,outputPres)
pastedSlide.shapes.title.text = "My very cool title"
for shape in pastedSlide.shapes:
if not(shape.has_text_frame): continue
# easiest ways to edit text fields is to put some identifying text in them
if shape.text_frame.text == "personName": # there is a text field with "personName" written into it
shape.text_frame.text = "Brian"
if shape.text_frame.text == "personSalary":
shape.text_frame.text = str(brianSalary)
# stylizing text need to be done after you change it
shape.text_frame.paragraphs[0].font.size = Pt(80)
shape.text_frame.paragraphs[0].alignment = PP_ALIGN.CENTER
outputPres.save(f'{DIR_PATH}/output.pptx')
Sorry for the delay, I was moved to another project. I was able to complete my ppt project using multiple template slides and copying them. At the end of building the presentation I delete the templates. To grab the shapes you will need to iterate through the slide.shapes and find the name of the shape that you are looking for. Once you have this returned you can then edit the shape as needed. I have added a version of the add_text function that I use to populate shape.text_frame.
def find_shape_by_name(shapes, name):
for shape in shapes:
if shape.name == name:
return shape
return None
def add_text(shape, text, alignment=None):
if alignment:
shape.vertical_anchor = alignment
tf = shape.text_frame
tf.clear()
run = tf.paragraphs[0].add_run()
run.text = text if text else ''
To find the shape "slide_title".
slide_title = find_shape_by_name(slide.shapes,'slide_title')
To add text to the shape.
add_text(slide_title,'TEST SLIDE')
Please let me know if you need any other assistance.

Add Header based on Condition

I'm using reportlab to generate a PDF document that has two types of reports.
Please assume reports are r1 and r2. There may be more than 2-3 pages in each report. So i want to add a header like text from second page of each report.
For example in r1 reports page add "r1 report continued..." and in the pages of
r2 report add "r2 report continued..." How can i do that.
Currently i'm creating a list of the elements and passing it to template build function. So i cannot identify which report is being processed.
For example...
elements = []
elements.append(r1)
...
.....
elements.append(r2)
doc.build(elements)
Finally i managed to resolve it. But i'm not sure if its a proper method.
A big thanks to grc who provided this answer from where i created my solution.
As in grc's answer i have created a afterFlowable callback function.
def afterFlowable(self,flowable):
if hasattr(flowable, 'cReport'):
cReport = getattr(flowable, 'cReport')
self.cReport = cReport
Then while adding data for the r1 report a custom attribute will be created
elements.append(PageBreak())
elements[-1].cReport = 'r1'
Same code while adding data for r2 report
elements.append(PageBreak())
elements[-1].cReport = 'r2'
Then in the onPage function of the template
template = PageTemplate(id='test', frames=frame, onPage=headerAndFooter)
def headerAndFooter(canvas, doc):
canvas.saveState()
if cReport == 'r1':
Ph = Paragraph("""<para>r1 Report (continued)</para>""",styleH5)
w, h = Ph.wrap(doc.width, doc.topMargin)
Ph.drawOn(canvas, doc.leftMargin, doc.height+doc.topMargin)
Note that i'm just copy and pasting parts of my code...

Categories

Resources