Keep filled fields in PDF after inserting image - ReportLab and pdfrw

Keep filled fields in PDF after inserting image - ReportLab and pdfrw - python

I have a fillable PDF and I have filled out some of the fields and saved it.
I am able to add an image to the PDF using pdfrw and ReportLab; however, when the PDF is saved, the data entered into the fillable fields has disappeared.
Can anyone point me in the right direction so that I can add an image to a PDF while still maintaining the filled fields?
Here is the reproducible code:
An example PDF is here
from PIL import Image
from reportlab.pdfgen.canvas import Canvas
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
inpfn = 'input.pdf'
outfn = 'output.pdf'
pages = PdfReader(inpfn).pages
page=pagexobj(pages[0])
canvas = Canvas(outfn)
canvas.doForm(makerl(canvas, page))
im = Image.open("photo.jpg")
canvas.drawInlineImage(im, 20, 175, width=100, height=60)
canvas.save()
Here is the output PDF. As you can see the image is there, but the fields are all empty.
I can verify that the data from the fields is present when the PDF is initially read by inspecting the structure of the PDF with:
for p in page:
for a in p['/Annots']:
print(a['/V'])
This prints out a bunch of data (with weird characters) but I can indeed see data that was entered into the form (e.g., "trade_name")

Related

PDFs written by PyPDF2 showing changes when opened in Acrobat

I'm using Python and PyPDF2 to generate a set of PDFs based on a template with form fields. The PDFs are created and all of the fields are filled correctly, but when I open the PDFs in Adobe Acrobat they show changes made to the file (i.e., the "Save" menu option is enabled, and when I try to close the file Adobe asks if I want to save changes, even if I haven't touched anything).
It's mostly just a slight annoyance, but is there a way to prevent this from happening? From my research it seems like this means (1) there's JavaScript modifying the file (there isn't), or (2) the file is corrupted and Adobe is fixing it.
A simplified version of my code is below. I set /NeedAppearances to True in both the reader and writer because otherwise the values didn't appear in the PDF unless I clicked on the field. I also set the annotations so that the fields are read-only and appear as regular text.
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, NumberObject
data = {'field1': 'Text1', 'field2': 'Text2'}
with open('template.pdf', 'rb') as read_file:
pdf_reader = PdfFileReader(read_file)
pdf_writer = PdfFileWriter()
# Set /NeedAppearances to make field values visible
try:
if '/AcroForm' in pdf_reader.trailer['/Root']:
pdf_reader.trailer['/Root']['/AcroForm'][NameObject('/NeedAppearances')] = BooleanObject(True)
if '/AcroForm' not in pdf_writer._root_object:
root = pdf_writer._root_object
acroform = {NameObject('/AcroForm'): IndirectObject(len(pdf_writer._objects), 0, pdf_writer)}
root.update(acroform)
root['/AcroForm'][NameObject('/NeedAppearances')] = BooleanObject(True)
except:
print('Warning: Error setting PDF /NeedAppearances value.')
# Add first page to writer
pdf_writer.addPage(pdf_reader.getPage(0))
page = pdf_writer.getPage(0)
# Update form fields
pdf_writer.updatePageFormFieldValues(page, data)
# Make fields read-only
for i in range(len(page['/Annots'])):
annot = page['/Annots'][i].getObject()
annot.update({NameObject('/Ff'): NumberObject(1)})
# Write PDF
with open('result.pdf', 'wb') as write_file:
pdf_writer.write(write_file)

Is there a way to extract images from a pdf in Python while preserving the location of the image in the pdf?

I need to extract images from a pdf without losing its location in the pdf. I need to know which page the image is on and where in the text the image is located, and then save the text and images in the pdf to a json file with the sequence of the data intact.

You could use pdfplumber and run this code:
import pdfplumber
pdf_obj = pdfplumber.open(doc_path)
page = pdf_obj.pages[page_no]
images_in_page = page.images
page_height = page.height
image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0'])
cropped_page = page.crop(image_bbox)
image_obj = cropped_page.to_image(resolution=400)
image_obj.save(path_to_save_image)
Source link here

IPython.display: Align Image with Text

I'm trying to display a base64 encoded image aligned to some HTML text in Jupyter notebooks. The code below is an initial attempt - the image (downloaded from https://stackoverflow.com/company/logos) is show belown the text.
import base64
from io import BytesIO
from IPython.display import Image
from IPython.core.display import display, HTML
with open(r'H:\se-icon.png', 'rb') as f:
data = BytesIO(f.read())
dataEncoded = base64.b64encode(data.getvalue())
display(HTML('<h3>Some Text</h3>'))
display(Image(dataEncoded, format='png', width=100))
The desired output is something like below. Ideally I could use just HTML code, with the img tag, however this does not work for my particular case, as the output of the notebook is emailed on. Use of the img tag does not guarantee that the image is displayed.
Is there any way to display the image aligned to the HTML text?

How to overlay text hyperlink on top of an image using reportlab?

I am trying to create a pdf document with many pages. Each page contains an image on top of which are displayed texts with hyperlinks to other pages.
I have managed to do this using reportlab.pdfgen and rectangle links, unfortunately the reader used (an e-reader) does not recognize this type of links.
Instead of it, I managed to create text embedded hyperlinks (using How to add a link to the word using reportlab?) which are recognizes by my e-reader but I didn't manage to overlay them on an image.
I tried to use the solution provided in this post (reportlab: add background image by using platypus) to use an image as background but it doesn't work. When I set the document's pagesize to the size of the image, the paragraph isn't shown. When I set it to a bigger size than the size of the image, the paragraph is shown above the image not overlaying it.
Here's my code:
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import BaseDocTemplate, PageTemplate, Frame, Paragraph
def draw_static(canvas, doc):
# Save the current settings
canvas.saveState()
# Draw the image
canvas.drawImage(r'path\to\the\image.png', 0, 0, width=500, height=500)
# Restore setting to before function call
canvas.restoreState()
# Set up a basic template
doc = BaseDocTemplate('test.pdf', pagesize=(500, 500))
# Create a Frame for the Flowables (Paragraphs and such)
frame = Frame(doc.leftMargin, doc.bottomMargin, 500, 500, id='normal')
# Add the Frame to the template and tell the template to call draw_static for each page
template = PageTemplate(id='test', frames=[frame], onPage=draw_static)
# Add the template to the doc
doc.addPageTemplates([template])
# All the default stuff for generating a document
styles = getSampleStyleSheet()
story = []
link = '<link href="http://example.com">Text</link>'
P = Paragraph(link, styles["Normal"])
story.append(P)
doc.build(story)
PS: This code is not complete (doesn't create multiple pages, etc ...) but is just a minimal example for trying to overlay text embedded hyperlinks on an image.

I finally found a simpler solution to my problem only using frames from platypus, mixed with pdfgen tools.
Solution:
from reportlab.pdfgen import canvas
from reportlab.platypus import Paragraph, Frame
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
c = canvas.Canvas("test.pdf")
styles = getSampleStyleSheet()
style = styles['Title']
items = []
link = '<link href="http://example.com">Text</link>'
items.append(Paragraph(link, style))
# Drawing the image
c.drawInlineImage(r'path\to\image', 0, 0)
# Create a Frame for the paragraph
f = Frame(inch, inch, inch, inch, showBoundary=1)
f.addFromList(items, c)
c.save()

Merge Existing PDF into new ReportLab PDF via flowables

I have a reportlab SimpleDocTemplate and returning it as a dynamic PDF. I am generating it's content based on some Django model metadata. Here's my template setup:
buff = StringIO()
doc = SimpleDocTemplate(buff, pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
Story = []
I can easily add textual metadata from the Entry model into the Story list to be built later:
ptext = '<font size=20>%s</font>' % entry.title.title()
paragraph = Paragraph(ptext, custom_styles["Custom"])
Story.append(paragraph)
And then generate the PDF to be returned in the response by calling build on the SimpleDocTemplate:
doc.build(Story, onFirstPage=entry_page_template, onLaterPages=entry_page_template)
pdf = buff.getvalue()
resp = HttpResponse(mimetype='application/x-download')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
resp.write(pdf)
return resp
One metadata field on the model is a file attachment. When those file attachments are PDFs, I'd like to merge them into the Story that I am generating; IE meaning a PDF of reportlab "flowable" type.
I'm attempting to do so using pdfrw, but haven't had any luck. Ideally I'd love to just call:
from pdfrw import PdfReader
pdf = pPdfReader(entry.document.file.path)
Story.append(pdf)
and append the pdf to the existing Story list to be included in the generation of the final document, as noted above.
Anyone have any ideas? I tried something similar using pagexobj to create the pdf, trying to follow this example:
http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
pdf = pagexobj(PdfReader(entry.document.file.path))
But didn't have any luck either. Can someone explain to me the best way to merge an existing PDF file into a reportlab flowable? I'm no good with this stuff and have been banging my head on pdf-generation for days now. :) Any direction greatly appreciated!

I just had a similar task in a project. I used reportlab (open source version) to generate pdf files and pyPDF to facilitate the merge. My requirements were slightly different in that I just needed one page from each attachment, but I'm sure this is probably close enough for you to get the general idea.
from pyPdf import PdfFileReader, PdfFileWriter
def create_merged_pdf(user):
basepath = settings.MEDIA_ROOT + "/"
# following block calls the function that uses reportlab to generate a pdf
coversheet_path = basepath + "%s_%s_cover_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
create_cover_sheet(coversheet_path, user, user.performancereview_set.all())
# now user the cover sheet and all of the performance reviews to create a merged pdf
merged_path = basepath + "%s_%s_merged_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
# for merged file result
output = PdfFileWriter()
# for each pdf file to add, open in a PdfFileReader object and add page to output
cover_pdf = PdfFileReader(file( coversheet_path, "rb"))
output.addPage(cover_pdf.getPage(0))
# iterate through attached files and merge. I only needed the first page, YMMV
for review in user.performancereview_set.all():
review_pdf = PdfFileReader(file(review.pdf_file.file.name, "rb"))
output.addPage(review_pdf.getPage(0)) # only first page of attachment
# write out the merged file
outputStream = file(merged_path, "wb")
output.write(outputStream)
outputStream.close()

I used the following class to solve my issue. It inserts the PDFs as vector PDF images.
It works great because I needed to have a table of contents. The flowable object allowed the built in TOC functionality to work like a charm.
Is there a matplotlib flowable for ReportLab?
Note: If you have multiple pages in the file, you have to modify the class slightly. The sample class is designed to just read the first page of the PDF.

I know the question is a bit old but I'd like to provide a new solution using the latest PyPDF2.
You now have access to the PdfFileMerger, which can do exactly what you want, append PDFs to an existing file. You can even merge them in different positions and choose a subset or all the pages!
The official docs are here: https://pythonhosted.org/PyPDF2/PdfFileMerger.html
An example from the code in your question:
import tempfile
import PyPDF2
from django.core.files import File
# Using a temporary file rather than a buffer in memory is probably better
temp_base = tempfile.TemporaryFile()
temp_final = tempfile.TemporaryFile()
# Create document, add what you want to the story, then build
doc = SimpleDocTemplate(temp_base, pagesize=letter, ...)
...
doc.build(...)
# Now, this is the fancy part. Create merger, add extra pages and save
merger = PyPDF2.PdfFileMerger()
merger.append(temp_base)
# Add any extra document, you can choose a subset of pages and add bookmarks
merger.append(entry.document.file, bookmark='Attachment')
merger.write(temp_final)
# Write the final file in the HTTP response
django_file = File(temp_final)
resp = HttpResponse(django_file, content_type='application/pdf')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
if django_file.size is not None:
resp['Content-Length'] = django_file.size
return resp

Use this custom flowable:
class PDF_Flowable(Flowable):
#----------------------------------------------------------------------
def __init__(self,P,page_no):
Flowable.__init__(self)
self.P = P
self.page_no = page_no
#----------------------------------------------------------------------
def draw(self):
"""
draw the line
"""
canv = self.canv
pages = self.P
page_no = self.page_no
canv.translate(x, y)
canv.doForm(makerl(canv, pages[page_no]))
canv.restoreState()
and then after opening existing pdf i.e.
pages = PdfReader(BASE_DIR + "/out3.pdf").pages
pages = [pagexobj(x) for x in pages]
for i in range(0, len(pages)):
F = PDF_Flowable(pages,i)
elements.append(F)
elements.append(PageBreak())
use this code to add this custom flowable in elements[].

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keep filled fields in PDF after inserting image - ReportLab and pdfrw - python

Related

PDFs written by PyPDF2 showing changes when opened in Acrobat

Is there a way to extract images from a pdf in Python while preserving the location of the image in the pdf?

IPython.display: Align Image with Text

How to overlay text hyperlink on top of an image using reportlab?

Merge Existing PDF into new ReportLab PDF via flowables

Categories

Resources