Merge Existing PDF into new ReportLab PDF via flowables - python

I have a reportlab SimpleDocTemplate and returning it as a dynamic PDF. I am generating it's content based on some Django model metadata. Here's my template setup:
buff = StringIO()
doc = SimpleDocTemplate(buff, pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
Story = []
I can easily add textual metadata from the Entry model into the Story list to be built later:
ptext = '<font size=20>%s</font>' % entry.title.title()
paragraph = Paragraph(ptext, custom_styles["Custom"])
Story.append(paragraph)
And then generate the PDF to be returned in the response by calling build on the SimpleDocTemplate:
doc.build(Story, onFirstPage=entry_page_template, onLaterPages=entry_page_template)
pdf = buff.getvalue()
resp = HttpResponse(mimetype='application/x-download')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
resp.write(pdf)
return resp
One metadata field on the model is a file attachment. When those file attachments are PDFs, I'd like to merge them into the Story that I am generating; IE meaning a PDF of reportlab "flowable" type.
I'm attempting to do so using pdfrw, but haven't had any luck. Ideally I'd love to just call:
from pdfrw import PdfReader
pdf = pPdfReader(entry.document.file.path)
Story.append(pdf)
and append the pdf to the existing Story list to be included in the generation of the final document, as noted above.
Anyone have any ideas? I tried something similar using pagexobj to create the pdf, trying to follow this example:
http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
pdf = pagexobj(PdfReader(entry.document.file.path))
But didn't have any luck either. Can someone explain to me the best way to merge an existing PDF file into a reportlab flowable? I'm no good with this stuff and have been banging my head on pdf-generation for days now. :) Any direction greatly appreciated!

I just had a similar task in a project. I used reportlab (open source version) to generate pdf files and pyPDF to facilitate the merge. My requirements were slightly different in that I just needed one page from each attachment, but I'm sure this is probably close enough for you to get the general idea.
from pyPdf import PdfFileReader, PdfFileWriter
def create_merged_pdf(user):
basepath = settings.MEDIA_ROOT + "/"
# following block calls the function that uses reportlab to generate a pdf
coversheet_path = basepath + "%s_%s_cover_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
create_cover_sheet(coversheet_path, user, user.performancereview_set.all())
# now user the cover sheet and all of the performance reviews to create a merged pdf
merged_path = basepath + "%s_%s_merged_%s.pdf" %(user.first_name, user.last_name, datetime.now().strftime("%f"))
# for merged file result
output = PdfFileWriter()
# for each pdf file to add, open in a PdfFileReader object and add page to output
cover_pdf = PdfFileReader(file( coversheet_path, "rb"))
output.addPage(cover_pdf.getPage(0))
# iterate through attached files and merge. I only needed the first page, YMMV
for review in user.performancereview_set.all():
review_pdf = PdfFileReader(file(review.pdf_file.file.name, "rb"))
output.addPage(review_pdf.getPage(0)) # only first page of attachment
# write out the merged file
outputStream = file(merged_path, "wb")
output.write(outputStream)
outputStream.close()

I used the following class to solve my issue. It inserts the PDFs as vector PDF images.
It works great because I needed to have a table of contents. The flowable object allowed the built in TOC functionality to work like a charm.
Is there a matplotlib flowable for ReportLab?
Note: If you have multiple pages in the file, you have to modify the class slightly. The sample class is designed to just read the first page of the PDF.

I know the question is a bit old but I'd like to provide a new solution using the latest PyPDF2.
You now have access to the PdfFileMerger, which can do exactly what you want, append PDFs to an existing file. You can even merge them in different positions and choose a subset or all the pages!
The official docs are here: https://pythonhosted.org/PyPDF2/PdfFileMerger.html
An example from the code in your question:
import tempfile
import PyPDF2
from django.core.files import File
# Using a temporary file rather than a buffer in memory is probably better
temp_base = tempfile.TemporaryFile()
temp_final = tempfile.TemporaryFile()
# Create document, add what you want to the story, then build
doc = SimpleDocTemplate(temp_base, pagesize=letter, ...)
...
doc.build(...)
# Now, this is the fancy part. Create merger, add extra pages and save
merger = PyPDF2.PdfFileMerger()
merger.append(temp_base)
# Add any extra document, you can choose a subset of pages and add bookmarks
merger.append(entry.document.file, bookmark='Attachment')
merger.write(temp_final)
# Write the final file in the HTTP response
django_file = File(temp_final)
resp = HttpResponse(django_file, content_type='application/pdf')
resp['Content-Disposition'] = 'attachment;filename=logbook.pdf'
if django_file.size is not None:
resp['Content-Length'] = django_file.size
return resp

Use this custom flowable:
class PDF_Flowable(Flowable):
#----------------------------------------------------------------------
def __init__(self,P,page_no):
Flowable.__init__(self)
self.P = P
self.page_no = page_no
#----------------------------------------------------------------------
def draw(self):
"""
draw the line
"""
canv = self.canv
pages = self.P
page_no = self.page_no
canv.translate(x, y)
canv.doForm(makerl(canv, pages[page_no]))
canv.restoreState()
and then after opening existing pdf i.e.
pages = PdfReader(BASE_DIR + "/out3.pdf").pages
pages = [pagexobj(x) for x in pages]
for i in range(0, len(pages)):
F = PDF_Flowable(pages,i)
elements.append(F)
elements.append(PageBreak())
use this code to add this custom flowable in elements[].

Related

How to extract the title, authors, creation date of a PDF in Python

I manage papers locally and rename each PDF file in the form of "creationdate_authors_title.pdf". Hence, extracting the title, authors, creation date of each paper from the PDF file automatically is required.
I have written a python script using the package pdfminer to extract info. However, for certain files, after parsing them, the file info stored in the dictionary doc.info[0] by using PDFDocument may not contain some keys such as "Author", or these keys' values are empty.
I'm wondering how can I locate the required info such as the paper's title directly from the PDF file using the function like "extract_pages". Or, more generally, how can I accurately and efficiently extract the info I required?
Any hint would be appreciated! Many thanks in advance.
You can use this script to extract all the metadata using the library PyPDF2
from PyPDF2 import PdfFileReader
def get_info(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print(info)
author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
get_info(path)
As you see inside the info variable you have all you need. Check this documentation

PDFs written by PyPDF2 showing changes when opened in Acrobat

I'm using Python and PyPDF2 to generate a set of PDFs based on a template with form fields. The PDFs are created and all of the fields are filled correctly, but when I open the PDFs in Adobe Acrobat they show changes made to the file (i.e., the "Save" menu option is enabled, and when I try to close the file Adobe asks if I want to save changes, even if I haven't touched anything).
It's mostly just a slight annoyance, but is there a way to prevent this from happening? From my research it seems like this means (1) there's JavaScript modifying the file (there isn't), or (2) the file is corrupted and Adobe is fixing it.
A simplified version of my code is below. I set /NeedAppearances to True in both the reader and writer because otherwise the values didn't appear in the PDF unless I clicked on the field. I also set the annotations so that the fields are read-only and appear as regular text.
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, NumberObject
data = {'field1': 'Text1', 'field2': 'Text2'}
with open('template.pdf', 'rb') as read_file:
pdf_reader = PdfFileReader(read_file)
pdf_writer = PdfFileWriter()
# Set /NeedAppearances to make field values visible
try:
if '/AcroForm' in pdf_reader.trailer['/Root']:
pdf_reader.trailer['/Root']['/AcroForm'][NameObject('/NeedAppearances')] = BooleanObject(True)
if '/AcroForm' not in pdf_writer._root_object:
root = pdf_writer._root_object
acroform = {NameObject('/AcroForm'): IndirectObject(len(pdf_writer._objects), 0, pdf_writer)}
root.update(acroform)
root['/AcroForm'][NameObject('/NeedAppearances')] = BooleanObject(True)
except:
print('Warning: Error setting PDF /NeedAppearances value.')
# Add first page to writer
pdf_writer.addPage(pdf_reader.getPage(0))
page = pdf_writer.getPage(0)
# Update form fields
pdf_writer.updatePageFormFieldValues(page, data)
# Make fields read-only
for i in range(len(page['/Annots'])):
annot = page['/Annots'][i].getObject()
annot.update({NameObject('/Ff'): NumberObject(1)})
# Write PDF
with open('result.pdf', 'wb') as write_file:
pdf_writer.write(write_file)

how do i change hyperlinks inside pdf using python?

How do I change the hyperlinks in pdf using python? I am currently using a pyPDF2 to open up and loop through the pages. How do I actually scan for hyperlinks and then proceed to change the hyperlinks?
So I couldn't get what you want using the pyPDF2 library.
I did however get something working with another library: pdfrw. This installed fine for me using pip in Python 3.6:
pip install pdfrw
Note: for the following I have been using this example pdf I found online which contains multiple links. Your mileage may vary with this.
import pdfrw
pdf = pdfrw.PdfReader("pdf.pdf") # Load the pdf
new_pdf = pdfrw.PdfWriter() # Create an empty pdf
for page in pdf.pages: # Go through the pages
# Links are in Annots, but some pages don't have links so Annots returns None
for annot in page.Annots or []:
old_url = annot.A.URI
# >Here you put logic for replacing the URLs<
# Use the PdfString object to do the encoding for us
# Note the brackets around the URL here
new_url = pdfrw.objects.pdfstring.PdfString("(http://www.google.com)")
# Override the URL with ours
annot.A.URI = new_url
new_pdf.addpage(page)
new_pdf.write("new.pdf")
I managed to get it working with PyPDF2.
If you just want to remove all annotations for a page, you just have to do:
if '/Annots' in page: del page['/Annots']
Else, here is how you change each link:
import PyPDF2
new_link = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # great video by the way
pdf_reader = PyPDF2.PdfFileReader("input.pdf")
pdf_writer = PyPDF2.PdfFileWriter()
for i in range(pdf_reader.getNumPages()):
page = pdf_reader.getPage(i)
if '/Annots' not in page: continue
for annot in page['/Annots']:
annot_obj = annot.getObject()
if '/A' not in annot_obj: continue # not a link
# you have to wrap the key and value with a TextStringObject:
key = PyPDF2.generic.TextStringObject("/URI")
value = PyPDF2.generic.TextStringObject(new_link)
annot_obj['/A'][key] = value
pdf_writer.addPage(page)
with open('output.pdf', 'wb') as f:
pdf_writer.write(f)
An equivalent one-liner for a given page index i and annotation index j would be:
pdf_reader.getPage(i)['/Annots'][j].getObject()['/A'][PyPDF2.generic.TextStringObject("/URI")] = PyPDF2.generic.TextStringObject(new_link)

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)

How can I set the PDF version with PyPDF2?

I'm using PyPDF2 1.4 and Python 2.7:
How can I change the PDF version from a file?
What I tried
my_input_filename.pdf is PDF version 1.5, but _my_output_filename.pdf is a 1.3 PDF, I want to keep 1.5 in the output:
from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import NameObject, createStringObject
input_filename = 'my_input_filename.pdf'
# Read input PDF file
inputPDF = PdfFileReader(open(input_filename, 'rb'))
info = inputPDF.documentInfo
for i in xrange(inputPDF.numPages):
# Create output PDF
outputPDF = PdfFileWriter()
# Create dictionary for output PDF
infoDict = outputPDF._info.getObject()
# Update output PDF metadata with input PDF metadata
for key in info:
infoDict.update({NameObject(key): createStringObject(info[key])})
outputPDF.addPage(inputPDF.getPage(i))
with open(output_filename , 'wb') as outputStream:
outputPDF.write(outputStream)
PyPDF2 in its current versions can't produce anything but files with a PDF1.3 header; from the official source code :
class PdfFileWriter(object):
"""
This class supports writing PDF files out, given pages produced by another
class (typically :class:`PdfFileReader<PdfFileReader>`).
"""
def __init__(self):
self._header = b_("%PDF-1.3")
...
If that is legal, considering it gives you the ability to feed in >1.3 things, is questionable.
If you want to just fix the version string in the header (I don't know which consequences that would have, so I assume you know more about the PDF standard than I do!)
from PyPDF2.utils import b_
...
outputPDF._header.replace(b_("PDF-1.3"),b_("PDF-1.5"))
or something of the like.
Going to add to Marcus' answer above:
There's (currently - I can't speak for when Marcus wrote his post) nothing stopping you from specifying the version in the metadata using standard PyPDF2 addMetadata function. The example below is using PdfFileMerger (as I've recently being doing some cleanup of PDF metadata on existing files), but PdfFileWriter has the same function:
from PyPDF2 import PdfFileMerger
# Define file input/output, and metadata containing version string.
# Using separate input/output files, since it's worth keeping a copy of the originals!
fileIn = 'foo.pdf'
fileOut = 'bar.pdf'
metadata = {
u'/Version': 'PDF-1.5'
}
# Set up PDF file merger, copy existing file contents into merger object.
merger = PdfFileMerger()
with open( fileIn, 'rb') as fh_in:
merger.append(fh_in)
# Append metadata to PDF content in merger.
merger.addMetadata(metadata)
# Write new PDF file with appended metadata to output
# CAUTION: This will overwrite any existing files without prompt!
with open( fileOut, 'wb' ) as fh_out:
merger.write(fh_out)
If I open the output of a file I ran through PyPDF2 it is showing version 1.7 8x, changing 1.3 to 1.5 or whatever doesn't make a difference

Categories

Resources