Python pyPFD2 PDF crop pages and merge into single page - python

Trying to crop a couple of PDF pages and merge them into a single role style page.
I need to remove the heather and the footer of each page and create a role style single page
import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import PageObject
reader = PdfFileReader('/Users/kic/Desktop/test.pdf','r')
writer = PdfFileWriter()
### find Total Height of file ###
numpages = reader.getNumPages() ## get number of pages
Height = reader.getPage(0).mediaBox.getHeight() ## get height of title page
Height = Height + 482 * (reader.getNumPages()-2) ## add number of height crop pages
### create new single role page ###
Single_page = PageObject.createBlankPage(None, reader.getPage(0).mediaBox.getWidth(), Height)
### add first title page without croping ###
Single_page.mergeTranslatedPage(reader.getPage(0),0,Height-reader.getPage(0).mediaBox.getHeight(),False)
### loop through all pages from page 2 until last page ###
n=1
for i in range(reader.getNumPages()-1):
i=n
page = reader.getPage(i)
page.cropBox.setUpperLeft((0,556))
page.cropBox.setUpperRight((page.mediaBox.getWidth(),556))
page.cropBox.setLowerLeft((0,74))
page.cropBox.setLowerRight((page.mediaBox.getWidth(),74))
Single_page.mergeTranslatedPage(page,0,482*(numpages-1-n),False)
#writer.addPage(page) ##to see the result of the cropped pages without merging
n = n+1
writer.addPage(Single_page)
output = open('/Users/kic/Desktop/testrcrop.pdf','wb')
writer.write(output)
output.close()
for some reason it is not cropping, its merging the pages into 1 page, but the heather and footer go on top of each other.
However if I don't merge into 1 page and just write the cropped pages into a PDF file with several pages they show up cropped.

You should translate the cropped area and set its mediabox new bounds for the position they'll occupy in the new page.
You translate relative to current position and crop by absolute values.
According with themergeTranslatedPage() documentation, it has been deprecated, and you should use add_transformation() and merge_page() instead.
The line y_translation = total_height - upper_bound calculates how much you need to translate each page considering the bounds of the cropped area in relation to the previous page you inserted on the new page.
For example:
You have 4 pages with a height of 800, and from page 2 to the end they'll be cropped at 600 and 200, your new page should have a height of 2000.
If the top of your first page is at 800, you'll have to translate it by 1200, and crop it on 2000 and 1200. The second page's cropped area top is at 600 and must be at 1200 (2000 - 800), so you need to translate it by 600 and crop it on 1200 and 800. And so on...
T=1200, U=2000, B=1200
T=600, U=1200, B=800
T=200, U=800, B=400
T=-200, U=400, B=0
from PyPDF2 import PageObject, PdfFileReader, PdfFileWriter, Transformation
# Define the crop bounds for pages other than the first page
CROP_Y_TOP = 556
CROP_Y_HEIGHT = 482
# Set the input and output file paths
input_path = r"input.pdf"
output_path = r"output.pdf"
# Open the input and output files in binary read and write mode
with open(input_path, "rb") as input_file, open(output_path, "wb") as output_file:
# Create a PdfFileReader object for the input file
reader = PdfFileReader(input_file)
# Create a PdfFileWriter object for the output file
writer = PdfFileWriter()
# Calculate the total height of the output page
total_height = reader.getPage(0).mediabox.height + (CROP_Y_HEIGHT * (reader.getNumPages() - 1))
# Create a blank page with the calculated total height
single_page = PageObject.create_blank_page(
pdf=None,
width=reader.getPage(0).mediabox.width,
height=total_height
)
# Loop through all pages of the input document
for i in range(reader.getNumPages()):
# Get the current page
page = reader.getPage(i)
original_mediabox = reader.getPage(i).mediaBox
# Determine the upper and lower bounds for the crop
upper_bound = original_mediabox.height if i == 0 else CROP_Y_TOP
lower_bound = 0 if i == 0 else CROP_Y_TOP - CROP_Y_HEIGHT
# Calculate the translation to apply to the page
y_translation = total_height - upper_bound
# Create a transformation object with the calculated translation
transformation = Transformation().translate(ty=y_translation)
# Apply the transformation to the page
page.add_transformation(transformation)
# Update the page media box with the new bounds
page.mediabox.lower_left = (0, lower_bound + y_translation)
page.mediabox.upper_right = (original_mediabox.width, upper_bound + y_translation)
print(f"T={y_translation}\tU={upper_bound + y_translation}\tL={lower_bound + y_translation}")
# Merge the transformed page onto the output page
single_page.merge_page(page)
# Decrease the total height by the height of the current page
total_height -= upper_bound
# Add the output page to the writer
writer.addPage(single_page)
# Write the output file
writer.write(output_file)

Related

Image Extraction using python

I'm using using fitz to extract images from a PDF file but the output images aren't in the same order as in the pdf. Is there any method to extract these images in order.
note: every page contains 2 images, and I want to extract the left one first then the right one.
for page_index in range(len(pdf_file)):
if ( page_index < 4 or page_index == len(pdf_file) - 1 ): continue
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
v = 0
for image_index, img in enumerate(page.getImageList(), start=1):
if v == 0:
v += 1
continue
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk

crop a pdf with PyPDF2

I've been working on a project in which I extract table data from a pdf with neural network,
I successfuly detect tables and get their coordinate (x,y,width,height) , I've been trying to crop the pdf with pypdf2 to isolate the table but for some reason cropping never matches the desired outcome.
After running inference i get these coordinates
[[5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02
9.9353129e-01]]
the 5th number is my neural network precision , we can safely ignore it
trying them in pyplot works , so there's no problem with them:
However using the same coords in pypdf2 is always off
from PyPDF2 import PdfFileWriter, PdfFileReader
with open("mypdf.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
for i in range(numPages):
page = input1.getPage(i)
page.cropBox.upperLeft = (5.0948269e+01,1.5970685e+02)
page.cropBox.upperLeft = (1.1579385e+03, 2.7092386e+02)
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
This is the output I get :
Am i missing something ?
thank you !
Here you go:
from PyPDF2 import PdfFileWriter, PdfFileReader
with open("mypdf.pdf", "rb") as in_f:
input1 = PdfFileReader(in_f)
output = PdfFileWriter()
numPages = input1.getNumPages()
x, y, w, h = (5.0948269e+01, 1.5970685e+02, 1.1579385e+03, 2.7092386e+02)
page_x, page_y = input1.getPage(0).cropBox.getUpperLeft()
upperLeft = [page_x.as_numeric(), page_y.as_numeric()] # convert PyPDF2.FloatObjects into floats
new_upperLeft = (upperLeft[0] + x, upperLeft[1] - y)
new_lowerRight = (new_upperLeft[0] + w, new_upperLeft[1] - h)
for i in range(numPages):
page = input1.getPage(i)
page.cropBox.upperLeft = new_upperLeft
page.cropBox.lowerRight = new_lowerRight
output.addPage(page)
with open("out.pdf", "wb") as out_f:
output.write(out_f)
Note: in PyPDF2 the origin of coordinates placed in the lower left corner of a page. And the Y-axis is directed from the bottom to up. Not like on the screen. So if you want to get a PDF-coordinate of top edge of your crop area you need to subtract y-coordinate of the top edge of the crop area from the height of the page.

Python WAND resize poor quality

I have the following code to extract PDF to JPG. I had to resize the img because of the large size, I loose the PDF original format (A4, A3 etc..) :
with Img(filename=pdfName, resolution=self.resolution) as document:
reader = PyPDF2.PdfFileReader(pdfName.replace('[0]', ''))
for page_number, page in enumerate(document.sequence):
pdfSize = reader.getPage(page_number).mediaBox
width = pdfSize[2]
height = pdfSize[3]
with Img(page, resolution=self.resolution) as img:
# Do not resize first page, which used to find useful informations
if not get_first_page:
img.resize(int(width), int(height))
img.compression_quality = self.compressionQuality
img.background_color = Color("white")
img.alpha_channel = 'remove'
if get_first_page:
filename = output
else:
filename = tmpPath + '/' + 'tmp-' + str(page_number) + '.jpg'
img.save(filename=filename)
So, for each page, I read the PDF size, and resize the output made with wand. But my problem is the quality of jpg, which is really poor...
My resolution is 300 (I try with upper value, without succes) and compressionQuality is 100
Any ideas ?
Thanks

Batch generating barcodes using ReportLab

Yesterday, I asked a question that was perhaps too broad.
Today, I've acted on my ideas in an effort to implement a solution.
Using ReportLab, pdfquery and PyPDF2, I'm trying to automate the process of generating barcodes on hundreds of pages in a PDF document.
Each page needs to have one barcode. However, if a page has a letter in the top right ('A' through 'E') then it needs to use the same barcode as the previous page. The files with letters on the top right are duplicate forms with similar information.
If there is no letter present, then a unique barcode number (incremented by one is fine) should be used on that page.
My code seems to work, but I'm having two issues:
The barcode moves around ever so slightly (minor issue).
The barcode value will not change (major issue). Only the first barcode number is set on all pages.
I can't seem to tell why the value isn't changing. Does anyone have an a clue?
Code is here:
import pdfquery
import os
from io import BytesIO
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.graphics.barcode import eanbc
from reportlab.graphics.shapes import Drawing
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.graphics import renderPDF
pdf = pdfquery.PDFQuery("letters-test.pdf")
total_pages = pdf.doc.catalog['Pages'].resolve()['Count']
print("Total pages", total_pages)
barcode_value = 12345670
output = PdfFileWriter()
for i in range(0, total_pages):
pdf.load(i) # Load page i into memory
duplicate_letter = pdf.pq('LTTextLineHorizontal:in_bbox("432,720,612,820")').text()
if duplicate_letter != '':
print("Page " + str(i+1) + " letter " + str(duplicate_letter))
print(barcode_value)
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# draw the eanbc8 code
barcode_eanbc8 = eanbc.Ean8BarcodeWidget(str(barcode_value))
bounds = barcode_eanbc8.getBounds()
width = bounds[2] - bounds[0]
height = bounds[3] - bounds[1]
d = Drawing(50, 10)
d.add(barcode_eanbc8)
renderPDF.draw(d, c, 400, 700)
c.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read existing PDF
existing_pdf = PdfFileReader(open("letters-test.pdf", "rb"))
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
else:
# increment barcode value
barcode_value += 1
print("Page " + str(i+1) + " isn't a duplicate.")
print(barcode_value)
packet = BytesIO()
c = canvas.Canvas(packet, pagesize=letter)
# draw the eanbc8 code
barcode_eanbc8 = eanbc.Ean8BarcodeWidget(str(barcode_value))
bounds = barcode_eanbc8.getBounds()
width = bounds[2] - bounds[0]
height = bounds[3] - bounds[1]
d = Drawing(50, 10)
d.add(barcode_eanbc8)
renderPDF.draw(d, c, 420, 710)
c.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read existing PDF
existing_pdf = PdfFileReader(open("letters-test.pdf", "rb"))
# add the "watermark" (which is the new pdf) on the existing page
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# Clear page i from memory and re load.
# pdf = pdfquery.PDFQuery("letters-test.pdf")
outputStream = open("newpdf.pdf", "wb")
output.write(outputStream)
outputStream.close()
And here is letters-test.pdf
as Kamil Nicki's answer pointed out, Ean8BarcodeWidget limiting effective digits to 7:
class Ean8BarcodeWidget(Ean13BarcodeWidget):
_digits=7
...
self.value=max(self._digits-len(value),0)*'0'+value[:self._digits]
you may change your encoding scheme or use EAN 13 barcode with Ean13BarcodeWidget, which has 12 digits usable.
The reason why your barcode is not changing is that you provided too long integer into eanbc.Ean8BarcodeWidget.
According to EAN standard EAN-8 barcodes are 8 digits long (7 digits + checkdigit)
Solution:
If you change barcode_value from 12345670 to 1234560 and run your script you will see that barcode value is increased as you want and checkdigit is appended as eighth number.
With that information in hand you should use only 7 digits to encode information in barcode.

PyPDF2 & ReportLab editing a PDF and merging multiple pages

I'm trying to add some text (page numbers) to an existing PDF file.
Using PyPDF2 package iterating through the original file, creating a canvas, then merging the two files. My problem is that once the program is finished, the new pdf file only has the last page from the original pdf, not all the pages.
eg. If the original pdf has 33 pages, the new pdf only has the last page but with the correct numbering.
Maybe the code can do a better job at explainng:
def test(location, reference, destination):
file = open(location, "rb")
read_pdf = PyPDF2.PdfFileReader(file)
for i in range (0, read_pdf.getNumPages()):
page = read_pdf.getPage(i)
pageReference = "%s_%s"%(reference,format(i+1, '03d'))
width = getPageSizeW(page)
height = getPageSizeH(page)
pagesize = (width, height)
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize = pagesize)
can.setFillColorRGB(1,0,0)
can.drawString(height*3.5, height*2.75, pageReference)
can.save()
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
#add new pdf to old pdf
output = PyPDF2.PdfFileWriter()
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
outputStream = open(destination, 'wb')
output.write(outputStream)
print(pageReference)
outputStream.close()
file.close()
def getPageSizeH(p):
h = float(p.mediaBox.getHeight()) * 0.352
return h
def getPageSizeW(p):
w = float(p.mediaBox.getWidth()) * 0.352
return w
Also if anyone has any ideas on how to insert the references on the top right in a better way, it would be appreciated.
I'm not an expert at PyPDF2 but it looks like the only area in your function where you have PyPDF2.PdfFileWriter() is in your for loop, so I suspect you are initiating a new file and adding to it each time in your for loop, which may cause the end result what you see.

Categories

Resources