PyPDF2 & ReportLab editing a PDF and merging multiple pages - python

I'm trying to add some text (page numbers) to an existing PDF file.
Using PyPDF2 package iterating through the original file, creating a canvas, then merging the two files. My problem is that once the program is finished, the new pdf file only has the last page from the original pdf, not all the pages.
eg. If the original pdf has 33 pages, the new pdf only has the last page but with the correct numbering.
Maybe the code can do a better job at explainng:
def test(location, reference, destination):
file = open(location, "rb")
read_pdf = PyPDF2.PdfFileReader(file)
for i in range (0, read_pdf.getNumPages()):
page = read_pdf.getPage(i)
pageReference = "%s_%s"%(reference,format(i+1, '03d'))
width = getPageSizeW(page)
height = getPageSizeH(page)
pagesize = (width, height)
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize = pagesize)
can.setFillColorRGB(1,0,0)
can.drawString(height*3.5, height*2.75, pageReference)
can.save()
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
#add new pdf to old pdf
output = PyPDF2.PdfFileWriter()
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
outputStream = open(destination, 'wb')
output.write(outputStream)
print(pageReference)
outputStream.close()
file.close()
def getPageSizeH(p):
h = float(p.mediaBox.getHeight()) * 0.352
return h
def getPageSizeW(p):
w = float(p.mediaBox.getWidth()) * 0.352
return w
Also if anyone has any ideas on how to insert the references on the top right in a better way, it would be appreciated.

I'm not an expert at PyPDF2 but it looks like the only area in your function where you have PyPDF2.PdfFileWriter() is in your for loop, so I suspect you are initiating a new file and adding to it each time in your for loop, which may cause the end result what you see.

Related

Cropping PDF files but different dimension for each page

I'm trying to crop a PDF file but different dimensions for each page. I have managed to use PyPDF2 but not sure how to crop on different dimensions for the second (and final) page. This is what I have for dimensions for first page so far. What do I need to add?
from PyPDF2 import PdfFileWriter, PdfFileReader
print_on = True
output = PdfFileWriter()
input = PdfFileReader(open('/Users/Downloads/lol.pdf', 'rb'))
n = input.getNumPages()
for i in range(n):
page = input.getPage(i)
page.cropBox.upperLeft = (63.389830508474574,643.7972508591065)
page.cropBox.lowerRight = (561.8644067796611,483.2096219931271)
output.addPage(page)
outputStream = open('/Users/Downloads/result.pdf', 'wb')
print('Done')
output.write(outputStream)
outputStream.close()

Separating large PDF document into smaller documents based on content

I have a large pdf file with very specific formatting, a bunch of reports if you will, all in one big pdf document. I'm using pdfplumber to extract specific text within a bounding box on each page. I've called this variable scene_text. The value of scene_text changes throughout the document, but many pages contain the same value for scene_text. I want to separate the large pdf into multiple smaller pdf files named according to their scene_text value with each pdf file containing all of the pages with matching scene_text. I'm terribly stuck, any help would be appreciated.
import pdfplumber
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
file = 'report.pdf'
with pdfplumber.open(file) as pdf:
for i, page in enumerate(pdf.pages):
# get scene text for current page
bounding_box = (880, 137, 1048, 180)
scene_text = page.within_bbox(bounding_box, relative=True).extract_text()
previous_page_text = pdf.pages[i-1].within_bbox(bounding_box, relative=True).extract_text()
inputpdf = PdfFileReader(open(file, "rb"))
output = PdfFileWriter()
for x, page in enumerate(pdf.pages):
st2 = page.within_bbox(bounding_box, relative=True).extract_text()
if st2 != previous_page_text:
output.addPage(inputpdf.getPage(i))
if st2 == scene_text:
if st2 == pdf.pages[x+1].within_bbox(bounding_box, relative=True).extract_text():
previous_page_text = st2
with open("page_export/" + scene_text + ".pdf", "wb") as output_stream:
output.write(output_stream)

Editing a pdf page by page

I'm trying to make unique edits to individual pages in a pre-existing pdf. However, the edits remain the same.
I've tried using FPDF (wasn't sure of how to edit a pre-existing pdf with this) and then am now trying PYPDF2 with reportlab.
#
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
def WriteOnPdf (targetpdf, pageTopicsDict):
packet = io.BytesIO()
# Create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont('Helvetica', 13)
can.drawString(5, 730, pageTopicsDict[0])
can.save()
# Move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# Read your existing PDF
existing_pdf = PdfFileReader(open(targetpdf, "rb"))
output = PdfFileWriter()
# Add the "watermark" (which is the new pdf) on the existing page
for i in range(existing_pdf.numPages):
print(i, pageTopicsDict[i])
can.drawString(5, 730, pageTopicsDict[i])
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))# index out of range if not set to 0.
output.addPage(page)
# Finally, write "output" to a real file
outputStream = open("destination.pdf", "wb")
output.write(outputStream)
outputStream.close()
dummyDict = {0: "abc", 1: "de, fg", 2: "hijklmn"}
WriteOnPdf ("test.pdf", dummyDict)
Expected: pdf with "abc" on top left hand corner of page 0, "de, fg" on page 1, "hijklmn" on page 2...
Actual: all pages have "abc"
Solved; initialized the packet and relevant variables in the for loop instead of outside.

Batch generating barcodes using ReportLab

Yesterday, I asked a question that was perhaps too broad.
Today, I've acted on my ideas in an effort to implement a solution.
Using ReportLab, pdfquery and PyPDF2, I'm trying to automate the process of generating barcodes on hundreds of pages in a PDF document.
Each page needs to have one barcode. However, if a page has a letter in the top right ('A' through 'E') then it needs to use the same barcode as the previous page. The files with letters on the top right are duplicate forms with similar information.
If there is no letter present, then a unique barcode number (incremented by one is fine) should be used on that page.
My code seems to work, but I'm having two issues:
The barcode moves around ever so slightly (minor issue).
The barcode value will not change (major issue). Only the first barcode number is set on all pages.
I can't seem to tell why the value isn't changing. Does anyone have an a clue?
Code is here:
import pdfquery
import os
from io import BytesIO
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.graphics.barcode import eanbc
from reportlab.graphics.shapes import Drawing
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.graphics import renderPDF
pdf = pdfquery.PDFQuery("letters-test.pdf")
total_pages = pdf.doc.catalog['Pages'].resolve()['Count']
print("Total pages", total_pages)
barcode_value = 12345670
output = PdfFileWriter()
for i in range(0, total_pages):
pdf.load(i) # Load page i into memory
duplicate_letter = pdf.pq('LTTextLineHorizontal:in_bbox("432,720,612,820")').text()
if duplicate_letter != '':
print("Page " + str(i+1) + " letter " + str(duplicate_letter))
print(barcode_value)
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# draw the eanbc8 code
barcode_eanbc8 = eanbc.Ean8BarcodeWidget(str(barcode_value))
bounds = barcode_eanbc8.getBounds()
width = bounds[2] - bounds[0]
height = bounds[3] - bounds[1]
d = Drawing(50, 10)
d.add(barcode_eanbc8)
renderPDF.draw(d, c, 400, 700)
c.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read existing PDF
existing_pdf = PdfFileReader(open("letters-test.pdf", "rb"))
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
else:
# increment barcode value
barcode_value += 1
print("Page " + str(i+1) + " isn't a duplicate.")
print(barcode_value)
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# draw the eanbc8 code
barcode_eanbc8 = eanbc.Ean8BarcodeWidget(str(barcode_value))
bounds = barcode_eanbc8.getBounds()
width = bounds[2] - bounds[0]
height = bounds[3] - bounds[1]
d = Drawing(50, 10)
d.add(barcode_eanbc8)
renderPDF.draw(d, c, 420, 710)
c.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read existing PDF
existing_pdf = PdfFileReader(open("letters-test.pdf", "rb"))
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# Clear page i from memory and re load.
# pdf = pdfquery.PDFQuery("letters-test.pdf")
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
And here is letters-test.pdf
as Kamil Nicki's answer pointed out, Ean8BarcodeWidget limiting effective digits to 7:
class Ean8BarcodeWidget(Ean13BarcodeWidget):
_digits=7
...
self.value=max(self._digits-len(value),0)*'0'+value[:self._digits]
you may change your encoding scheme or use EAN 13 barcode with Ean13BarcodeWidget, which has 12 digits usable.
The reason why your barcode is not changing is that you provided too long integer into eanbc.Ean8BarcodeWidget.
According to EAN standard EAN-8 barcodes are 8 digits long (7 digits + checkdigit)
Solution:
If you change barcode_value from 12345670 to 1234560 and run your script you will see that barcode value is increased as you want and checkdigit is appended as eighth number.
With that information in hand you should use only 7 digits to encode information in barcode.

How to "write to variable" instead of "to file" in Python

I'm trying to write a function which splits a pdf into separate pages. From this SO answer. I copied a simple function which splits a pdf into separate pages:
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
with open("document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
return pages
This however, writes the new PDFs to file, instead of returning a list of the new PDFs as file variables. So I changed the line of output.write(outputStream) to:
pages.append(outputStream)
When trying to write the elements in the pages list however, I get a ValueError: I/O operation on closed file.
Does anybody know how I can add the new files to the list and return them, instead of writing them to file? All tips are welcome!
It is not completely clear what you mean by "list of PDFs as file variables. If you want to create strings instead of files with PDF contents, and return a list of such strings, replace open() with StringIO and call getvalue() to obtain the contents:
import cStringIO
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
io = cStringIO.StringIO()
output.write(io)
pages.append(io.getvalue())
return pages
You can use the in-memory binary streams in the io module. This will store the pdf files in your memory.
import io
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
outputStream = io.BytesIO()
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
output.write(outputStream)
# Move the stream position to the beginning,
# making it easier for other code to read
outputStream.seek(0)
pages.append(outputStream)
return pages
To later write the objects to a file, use shutil.copyfileobj:
import shutil
with open('page0.pdf', 'wb') as out:
shutil.copyfileobj(pages[0], out)
Haven't used PdfFileWriter, but think that this should work.
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
pages.append(output)
return pages
def writePdf(pages):
i = 1
for p in pages:
with open("document-page%s.pdf" % i, "wb") as outputStream:
p.write(outputStream)
i += 1

Categories

Resources