Separating large PDF document into smaller documents based on content

Separating large PDF document into smaller documents based on content - python

I have a large pdf file with very specific formatting, a bunch of reports if you will, all in one big pdf document. I'm using pdfplumber to extract specific text within a bounding box on each page. I've called this variable scene_text. The value of scene_text changes throughout the document, but many pages contain the same value for scene_text. I want to separate the large pdf into multiple smaller pdf files named according to their scene_text value with each pdf file containing all of the pages with matching scene_text. I'm terribly stuck, any help would be appreciated.
import pdfplumber
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
file = 'report.pdf'
with pdfplumber.open(file) as pdf:
for i, page in enumerate(pdf.pages):
# get scene text for current page
bounding_box = (880, 137, 1048, 180)
scene_text = page.within_bbox(bounding_box, relative=True).extract_text()
previous_page_text = pdf.pages[i-1].within_bbox(bounding_box, relative=True).extract_text()
inputpdf = PdfFileReader(open(file, "rb"))
output = PdfFileWriter()
for x, page in enumerate(pdf.pages):
st2 = page.within_bbox(bounding_box, relative=True).extract_text()
if st2 != previous_page_text:
output.addPage(inputpdf.getPage(i))
if st2 == scene_text:
if st2 == pdf.pages[x+1].within_bbox(bounding_box, relative=True).extract_text():
previous_page_text = st2
with open("page_export/" + scene_text + ".pdf", "wb") as output_stream:
output.write(output_stream)

Related

Filtering the extract of a pdf

I am currently extracting data from a pdf.
My code is able to extract each line of that PDF.
What I want is:
to extract some information, ie: 'Téléc', 'Courriel', 'Territoire desservi'
to have the data that matches such information, ie: Téléc = 579-240-XXX, and so on.
Here is my code, and what can we do to filter my extract?
Code:
import PyPDF2
pdfileObj = xxxxxxxxxxxx
pdf = open(pdfileObj, 'rb')
pdfreader = PyPDF2.PdfFileReader(pdfileObj)
pageObj = pdfreader.getPage(10)
text = pageObj.extract_text()
for words in text:
search_keywords = ['Téléc', 'Courriel', 'Territoire desservi']
print(text)

Python PyMuPDF looping next pages

I'm using below code to open a PDF file and convert into an image file as output. Now, i'm trying to figure it out how can I loop the next page and convert it as same output file. Any help is much appreciated!
# display image on the canvas
def openFile(self, _value=False):
global fileImg, output
path = os.path.dirname(ustr(self.filePath)) if self.filePath else '.'
fileImg = QFileDialog.getOpenFileName(self, '%s - Choose file' % __appname__, path)
# convert PDF to image file
pdffile = fileImg
doc = fitz.open(pdffile)
page = doc.loadPage(0)
pix = page.getPixmap(matrix=fitz.Matrix(100 / 72, 100 / 72))
output = "output.png"
pix.writePNG(output)

You can simply loop over the doc object to get the next pages.
doc = fitz.open(file_name) # open document
for page in doc: # iterate through the pages
pix = page.getPixmap(...) # render page to an image
pix.writePNG("page-%i.png" % page.number) # store image as a PNG
check the PyMuPDF documentation for more information.

You can use minecart and use this snippet to split pdf into images
import minecart
from PIL import Image
file =open('Yourdoc.pdf','rb')
doc = minecart.Document(file)
page=doc.iter_pages()
pageref=[]
for j,i in enumerate( page):
im = i.images[0].as_pil()
im.save(f"folderlocation/{j}.jpg")

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.
Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw.
Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.
I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!
Some example code:
1) With PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand
So I dont know how to proceed from here
2) With pdfrw (has more promise)
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good
Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again
zp

As mentioned in my comment, I'm posting a generic solution to merge several pdfs that works in PyPDF2. Dont know what is different to make this work in PyPDF2 other than initializing pls as ArrayObject()
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
import PyPDF2.pdf as PDF
# pls holds all the pagelabels as we iterate through multiple pdfs
pls = PDF.ArrayObject()
# used to offset bookmarks
pageCount = 0
cpdf = PdfFileMerger()
# pdffiles is a list of all files to be merged
for i in range(len(pdffiles)):
tmppdf = PdfFileReader(pdffiles[i], 'rb')
cpdf.append(tmppdf)
# copy all the pagelabels which I assume is present in all files
# you could use 'try' in case no pagelabels are present
plstmp = tmppdf.trailer['/Root']['/PageLabels']['/Nums']
# sometimes keys are indirect objects
# so, iterate through each pagelabel and...
for j in range(len(plstmp)):
# ... get the actual values
plstmp[j] = plstmp[j].getObject()
# offset pagenumbers by current count of pages
if isinstance(plstmp[j], int):
plstmp[j] = PDF.NumberObject(plstmp[j] + pageCount)
# once all the pagelabels are processed I append to pls
pls += plstmp
#increment pageCount
pageCount += tmppdf.getNumPages()
# rest follows KevinM's answer
pagenums = PDF.DictionaryObject()
pagenums.update({PDF.NameObject('/Nums') : pls})
pagelabels = PDF.DictionaryObject()
pagelabels.update({PDF.NameObject('/PageLabels') : pagenums})
cpdf.output._root_object.update(pagelabels)
cpdf.write("filename.pdf")

You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.
This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).
from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader
# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF
import logging
def add_nums(num_entry, page_offset, nums_array):
for num in num_entry['/Nums']:
if isinstance(num, (int)):
logging.debug("Found page number %s, offset %s: ", num, page_offset)
# Add the physical page information
nums_array.append(PDF.NumberObject(num+page_offset))
else:
# {'/S': '/r'}, or {'/S': '/D', '/St': 489}
keys = num.keys()
logging.debug("Found page label, keys: %s", keys)
number_type = PDF.DictionaryObject()
# Always copy the /S entry
s_entry = num['/S']
number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
logging.debug("Adding /S entry: %s", s_entry)
if '/St' in keys:
# If there is an /St entry, fetch it
pdf_label_offset = num['/St']
# and add the new offset to it
logging.debug("Found /St %s", pdf_label_offset)
number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})
# Add the label information
nums_array.append(number_type)
return nums_array
def write_merged(pdf_readers):
# Output
merger = PdfFileMerger()
# For PageLabels information
page_labels = []
page_offset = 0
nums_array = PDF.ArrayObject()
# Iterate through all the inputs
for pdf_reader in pdf_readers:
try:
# Merge the content
merger.append(pdf_reader)
# Handle the PageLabels
# Fetch page information
old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
page_count = pdf_reader.getNumPages()
# Add PageLabel information
add_nums(old_page_labels, page_offset, nums_array)
page_offset = page_offset + page_count
except Exception as err:
print("ERROR: %s" % err)
# Add PageLabels
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"): nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = merger.output._root_object
root_obj.update(page_labels)
# Write output
merger.write('merged.pdf')
pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)
write_merged(pdf_readers)
Note: PyPDF2 produces this weird error:
...
...
File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
data[key] = value
File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

Editing a pdf page by page

I'm trying to make unique edits to individual pages in a pre-existing pdf. However, the edits remain the same.
I've tried using FPDF (wasn't sure of how to edit a pre-existing pdf with this) and then am now trying PYPDF2 with reportlab.
#
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
def WriteOnPdf (targetpdf, pageTopicsDict):
packet = io.BytesIO()
# Create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont('Helvetica', 13)
can.drawString(5, 730, pageTopicsDict[0])
can.save()
# Move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# Read your existing PDF
existing_pdf = PdfFileReader(open(targetpdf, "rb"))
output = PdfFileWriter()
# Add the "watermark" (which is the new pdf) on the existing page
for i in range(existing_pdf.numPages):
print(i, pageTopicsDict[i])
can.drawString(5, 730, pageTopicsDict[i])
page = existing_pdf.getPage(i)
page.mergePage(new_pdf.getPage(0))# index out of range if not set to 0.
output.addPage(page)
# Finally, write "output" to a real file
outputStream = open("destination.pdf", "wb")
output.write(outputStream)
outputStream.close()
dummyDict = {0: "abc", 1: "de, fg", 2: "hijklmn"}
WriteOnPdf ("test.pdf", dummyDict)
Expected: pdf with "abc" on top left hand corner of page 0, "de, fg" on page 1, "hijklmn" on page 2...
Actual: all pages have "abc"

Solved; initialized the packet and relevant variables in the for loop instead of outside.

PyPDF2 & ReportLab editing a PDF and merging multiple pages

I'm trying to add some text (page numbers) to an existing PDF file.
Using PyPDF2 package iterating through the original file, creating a canvas, then merging the two files. My problem is that once the program is finished, the new pdf file only has the last page from the original pdf, not all the pages.
eg. If the original pdf has 33 pages, the new pdf only has the last page but with the correct numbering.
Maybe the code can do a better job at explainng:
def test(location, reference, destination):
file = open(location, "rb")
read_pdf = PyPDF2.PdfFileReader(file)
for i in range (0, read_pdf.getNumPages()):
page = read_pdf.getPage(i)
pageReference = "%s_%s"%(reference,format(i+1, '03d'))
width = getPageSizeW(page)
height = getPageSizeH(page)
pagesize = (width, height)
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize = pagesize)
can.setFillColorRGB(1,0,0)
can.drawString(height*3.5, height*2.75, pageReference)
can.save()
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
#add new pdf to old pdf
output = PyPDF2.PdfFileWriter()
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
outputStream = open(destination, 'wb')
output.write(outputStream)
print(pageReference)
outputStream.close()
file.close()
def getPageSizeH(p):
h = float(p.mediaBox.getHeight()) * 0.352
return h
def getPageSizeW(p):
w = float(p.mediaBox.getWidth()) * 0.352
return w
Also if anyone has any ideas on how to insert the references on the top right in a better way, it would be appreciated.

I'm not an expert at PyPDF2 but it looks like the only area in your function where you have PyPDF2.PdfFileWriter() is in your for loop, so I suspect you are initiating a new file and adding to it each time in your for loop, which may cause the end result what you see.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Separating large PDF document into smaller documents based on content - python

Related

Filtering the extract of a pdf

Python PyMuPDF looping next pages

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

Editing a pdf page by page

PyPDF2 & ReportLab editing a PDF and merging multiple pages

Categories

Resources