Python Splitting PDF Leaves Meta-Text From Other Pages - python

The following code has successfully split a large PDF file into smaller PDFs of 2 pages each. However, if I look into one of the files, I see meta-text from others.
This is used to split the PDF into smaller ones:
import numpy as np
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open(path+"multi.pdf", "rb"))
r=np.arange(inputpdf.numPages)
r2=[(r[i],r[i+1]) for i in range(0,len(r),2)]
for i in r2:
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i[0]))
output.addPage(inputpdf.getPage(i[1]))
with open(path+"document-page %s.pdf" % i[0], "wb") as outputStream:
output.write(outputStream)
This is used to get the meta-text of one of the resulting files (PyPDF2 will not read it):
import pdfx
path=path+'document-page 8.pdf'
pdf = pdfx.PDFx(path)
pdf.get_text()
My issues with this are:
The process is super slow and all I want is a 10 digit number in the upper-right corner of the first page. Can I somehow just get that part?
When looking at the result, it has text from other adjacent pages from the original PDF file (which is why I'm calling it "meta-text"). Why is that? Can this be resolved?
Update:
pdf.get_references_count()
...shows 20 (there should only be 2)
Thanks in advance!

Related

How can I split a PDF file by its pages into PDF files of a smaller size with PyPDF2?

I wrote simple python code that gets PDF, goes over its pages using PyPDF2 and saves each page as new PDF file.
see page save function here:
from PyPDF2 import PdfReader, PdfWriter
def save_pdf_page(file_name, page_index):
reader = PdfReader(file_name)
writer = PdfWriter()
writer.add_page(reader.pages[page_index])
writer.remove_links()
with open(f"output_page{page_index}.pdf", "wb") as fh:
writer.write(fh)
Surprisingly each page is about the same size as the original PDF file.
using removeLinks (taken from here) didn't reduce page size
I found similar question here, saying it may be caused because PyPDF output files are uncompressed.
Is there a way using PyPDF or any other python lib to make each page relatively small as expected?
You are running into this issue: https://github.com/py-pdf/PyPDF2/issues/449
Essentially the are two problems:
Every page might need a resource which is shared, eg font information
PyPDF2 might not realize if some pages don't need it
Remove links might help. Additionally, you might want to follow the docs to reduce file size:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
writer = PdfWriter()
for page_num in [2, 3]:
page = reader.pages[page_num]
# This is CPU intensive! It ZIPs the contents of the page
page.compress_content_streams()
writer.add_page(page)
with open("seperate.pdf", "wb") as fh:
writer.remove_links()
writer.write(fh)

(Python) Converting Hundreds of PNGs to a Single PDF

I have a folder with 452 images (.png) that I'm trying to merge into a single PDF file, using Python. Each of the images are labelled by their intended page number, e.g. "1.png", "2.png", ....., "452.png".
This code was technically successful, but input the pages out of the intended order.
import img2pdf
from PIL import Image
with open("output.pdf", 'wb') as f:
f.write(img2pdf.convert([i for i in os.listdir('.') if i.endswith(".png")]))
I also tried reading the data as binary data, then convert it and write it to the PDF, but this yields a 94MB one-page PDF.
import img2pdf
from PIL import Image
with open("output.pdf", 'wb') as f:
for i in range(1, 453):
img = Image.open(f"{i}.png")
pdf_bytes = img2pdf.convert(img)
f.write(pdf_bytes)
Any help would be appreciated, I've done quite a bit of research, but have come up short. Thanks in advance.
but input the pages out of the intended order
I suspect that the intended order is "in numerical order of file name", i.e. 1.png, 2.png, 3.png, and so forth.
This can be solved with:
with open("output.pdf", 'wb') as f:
f.write(img2pdf.convert(sorted([i for i in os.listdir('.') if i.endswith(".png")], key=lambda fname: int(fname.rsplit('.',1)[0]))))
This is a slightly modified version of your first attempt, that just sorts the file names (in the way your second attempt tries to do) before batch-writing it to the PDF

Is there a faster way to merge two files rather than page by page?

I'm on Python 3, using PyPDF2 and in order to add page numbers to a newly generated PDF (which I do using reportlab) I merge the two PDF files page by page in the following way:
from PyPDF2 import PdfFileWriter, PdfFileReader
def merge_pdf_files(first_pdf_fp, second_pdf_fp, target_fp):
"""
Merges two PDF files into a target final PDF file.
Args:
first_pdf_fp: the first PDF file path.
second_pdf_fp: the second PDF file path.
target_fp: the target PDF file path.
"""
pdf1 = PdfFileReader(first_pdf_fp)
pdf2 = PdfFileReader(second_pdf_fp)
assert (pdf1.getNumPages() == pdf2.getNumPages())
final_pdf_writer = PdfFileWriter()
for i in range(pdf1.getNumPages()):
number_page = pdf1.getPage(i)
content_page = pdf2.getPage(i)
content_page.mergePage(number_page)
final_pdf_writer.addPage(content_page)
with open(target_fp, "wb") as final_os:
final_pdf_writer.write(final_os)
But this is very slow. Is there a faster and cleaner way to a merge at once using PyPDF2?
I do not have enough 'reputation' to comment. But since I was going to post an answer I made it long.
Normally when people want to 'merge' documents they mean 'combining' them, or as you point out, concatenate or append one pdf at the end of the other (or somewhere in between). But based on the code you present, it seems you meant overlaying one pdf over another, right? Or in other words, you want page 1 from both pdf1 and pdf2 to be combined in to a single page as part of a new pdf.
If so, you could use this (modified from example used to illustrate watermarking). It is still overlaying one page at a time. But, pdfrw is known to be super fast compared to PyPDF2 and supposed to work well with reportlab. I havent compared the speeds, so not sure if this will actually be faster than what you already have
from pdfrw import PdfReader, PdfWriter, PageMerge
p1 = pdfrw.PdfReader("file1")
p2 = pdfrw.PdfReader("file2")
for page in range(len(p1.pages)):
merger = PageMerge(p1.pages[page])
merger.add(p2.pages[page]).render()
writer = PdfWriter()
writer.write("output.pdf", p1)
Try this.
You can use PyPdf2s PdfMerger class.
using file Concatenation, you can concatenate files using append method
from PyPDF2 import PdfFileMerger
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
Maybe the answer will help you in Is there a way to speed up PDF page merging... where using multiprocessing takes 100% of the processor

Split specific pages of PDF and save it with Python

I am trying to split 20 pages of pdf file (single) , into five respective pdf files , 1st pdf contains 1-3 pages , 2nd pdf file contains only 4th page, 3rd pdf contains 5 to 10 pages, 4th pdf contains 11-17 pages , and 5th pdf contains 18-20 page . I need the working code in python. The below mentioned code splits the entire pdf file into single pages, but I want the grouped pages..
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("input.pdf", "rb"))
for i in range(inputpdf.numPages):
j = i+1
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("page%s.pdf" % j, "wb") as outputStream:
output.write(outputStream)
For me it looks like task for pdfrw using this example from GitHub I written following example code:
from pdfrw import PdfReader, PdfWriter
pages = PdfReader('inputfile.pdf').pages
parts = [(3,6),(7,10)]
for part in parts:
outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf')
for pagenum in range(*part):
outdata.addpage(pages[pagenum-1])
outdata.write()
This one create two files: pages_3_6.pdf and pages_7_10.pdf each with 3 pages i.e. 3,4,5 and 7,8,9. Note pagenum-1 in code, that -1 is used due to fact that pdf pages numeration starts at 1 rather than 0. I also used so-called f-strings to get names of output files. In my opinion it is slick method but it is not available in Python2 and I am not sure if it is available in all Python3 versions (I tested my code in 3.6.7), so you might use old formatting method instead if you wish.
Remember to alter filenames and ranges accordingly to your needs.
if you have python 3, you can use tika according to the following answer here:
How to extract text from a PDF file?
How to extract specific pages (or split specific pages) from a PDF file and save those pages as a separate PDF using Python.
pip install PyPDF2 # to install module/package
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
CREDIT :
How to extract PDF pages and save as a separate PDF file using Python

PyPDF2 Create a single page by appending two or more pages one after another using Python

I am trying to find a way to append two or more pages one after another and create a single page. I tried the following solution it creates a single paged pdf however the page 2 overlaps on page 1.
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
file1 = PdfFileReader(open('good_old.pdf', "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file1.getPage(1))
outputStream = open("great_new.pdf", "wb")
output.write(outputStream)
outputStream.close()
I just want the two pages to appear one after another but as a single page, if the height of the two pages is x. the output pdf will have a single page with 2x height.

Categories

Resources