Split specific pages of PDF and save it with Python - python

I am trying to split 20 pages of pdf file (single) , into five respective pdf files , 1st pdf contains 1-3 pages , 2nd pdf file contains only 4th page, 3rd pdf contains 5 to 10 pages, 4th pdf contains 11-17 pages , and 5th pdf contains 18-20 page . I need the working code in python. The below mentioned code splits the entire pdf file into single pages, but I want the grouped pages..
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("input.pdf", "rb"))
for i in range(inputpdf.numPages):
j = i+1
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("page%s.pdf" % j, "wb") as outputStream:
output.write(outputStream)

For me it looks like task for pdfrw using this example from GitHub I written following example code:
from pdfrw import PdfReader, PdfWriter
pages = PdfReader('inputfile.pdf').pages
parts = [(3,6),(7,10)]
for part in parts:
outdata = PdfWriter(f'pages_{part[0]}_{part[1]}.pdf')
for pagenum in range(*part):
outdata.addpage(pages[pagenum-1])
outdata.write()
This one create two files: pages_3_6.pdf and pages_7_10.pdf each with 3 pages i.e. 3,4,5 and 7,8,9. Note pagenum-1 in code, that -1 is used due to fact that pdf pages numeration starts at 1 rather than 0. I also used so-called f-strings to get names of output files. In my opinion it is slick method but it is not available in Python2 and I am not sure if it is available in all Python3 versions (I tested my code in 3.6.7), so you might use old formatting method instead if you wish.
Remember to alter filenames and ranges accordingly to your needs.

if you have python 3, you can use tika according to the following answer here:
How to extract text from a PDF file?

How to extract specific pages (or split specific pages) from a PDF file and save those pages as a separate PDF using Python.
pip install PyPDF2 # to install module/package
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
CREDIT :
How to extract PDF pages and save as a separate PDF file using Python

Related

How can I split a PDF file by its pages into PDF files of a smaller size with PyPDF2?

I wrote simple python code that gets PDF, goes over its pages using PyPDF2 and saves each page as new PDF file.
see page save function here:
from PyPDF2 import PdfReader, PdfWriter
def save_pdf_page(file_name, page_index):
reader = PdfReader(file_name)
writer = PdfWriter()
writer.add_page(reader.pages[page_index])
writer.remove_links()
with open(f"output_page{page_index}.pdf", "wb") as fh:
writer.write(fh)
Surprisingly each page is about the same size as the original PDF file.
using removeLinks (taken from here) didn't reduce page size
I found similar question here, saying it may be caused because PyPDF output files are uncompressed.
Is there a way using PyPDF or any other python lib to make each page relatively small as expected?
You are running into this issue: https://github.com/py-pdf/PyPDF2/issues/449
Essentially the are two problems:
Every page might need a resource which is shared, eg font information
PyPDF2 might not realize if some pages don't need it
Remove links might help. Additionally, you might want to follow the docs to reduce file size:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
writer = PdfWriter()
for page_num in [2, 3]:
page = reader.pages[page_num]
# This is CPU intensive! It ZIPs the contents of the page
page.compress_content_streams()
writer.add_page(page)
with open("seperate.pdf", "wb") as fh:
writer.remove_links()
writer.write(fh)

How can I remove every other page of a pdf with python?

I downloaded a pdf where every other page is blank and I'd like to remove the blank pages. I could do this manually in a pdf tool (Adobe Acrobat, Preview.app, PDFPen, etc.) but since it's several hundred pages I'd like to do something more automated. Is there a way to do this in python?
One way is to use pypdf, so in your terminal first do
pip install pypdf4
Then create a .py script file similar to this:
# pdf_strip_every_other_page.py
from PyPDF4 import PdfFileReader, PdfFileWriter
number_of_pages = 500
output_writer = PdfFileWriter()
with open("/path/to/original.pdf", "rb") as inputfile:
pdfOne = PdfFileReader(inputfile)
for i in list(range(0, number_of_pages)):
if i % 2 == 0:
page = pdfOne.getPage(i)
output_writer.addPage(page)
with open("/path/to/output.pdf", "wb") as outfile:
output_writer.write(outfile)
Note: you'll need to change the paths to what's appropriate for your scenario.
Obviously this script is rather crude and could be improved, but wanted to share it for anyone else wanting a quick way to deal with this scenario.

Python Splitting PDF Leaves Meta-Text From Other Pages

The following code has successfully split a large PDF file into smaller PDFs of 2 pages each. However, if I look into one of the files, I see meta-text from others.
This is used to split the PDF into smaller ones:
import numpy as np
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open(path+"multi.pdf", "rb"))
r=np.arange(inputpdf.numPages)
r2=[(r[i],r[i+1]) for i in range(0,len(r),2)]
for i in r2:
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i[0]))
output.addPage(inputpdf.getPage(i[1]))
with open(path+"document-page %s.pdf" % i[0], "wb") as outputStream:
output.write(outputStream)
This is used to get the meta-text of one of the resulting files (PyPDF2 will not read it):
import pdfx
path=path+'document-page 8.pdf'
pdf = pdfx.PDFx(path)
pdf.get_text()
My issues with this are:
The process is super slow and all I want is a 10 digit number in the upper-right corner of the first page. Can I somehow just get that part?
When looking at the result, it has text from other adjacent pages from the original PDF file (which is why I'm calling it "meta-text"). Why is that? Can this be resolved?
Update:
pdf.get_references_count()
...shows 20 (there should only be 2)
Thanks in advance!

PyPDF2 Create a single page by appending two or more pages one after another using Python

I am trying to find a way to append two or more pages one after another and create a single page. I tried the following solution it creates a single paged pdf however the page 2 overlaps on page 1.
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
file1 = PdfFileReader(open('good_old.pdf', "rb"))
output = PdfFileWriter()
page = file1.getPage(0)
page.mergePage(file1.getPage(1))
outputStream = open("great_new.pdf", "wb")
output.write(outputStream)
outputStream.close()
I just want the two pages to appear one after another but as a single page, if the height of the two pages is x. the output pdf will have a single page with 2x height.

Combine two lists of PDFs one to one using Python

I have created a series of PDF documents (maps) using data driven pages in ESRI ArcMap 10. There is a page 1 and page 2 for each map generated from separate *.mxd. So I have one list of PDF documents containing page 1 for each map and one list of PDF documents containing page 2 for each map. For example: Map1_001.pdf, map1_002.pdf, map1_003.pdf...map2_001.pdf, map2_002.pdf, map2_003.pdf...and so one.
I would like to append these maps, pages 1 and 2, together so that both page 1 and 2 are together in one PDF per map. For example: mapboth_001.pdf, mapboth_002.pdf, mapboth_003.pdf... (they don't have to go into a new pdf file (mapboth), it's fine to append them to map1)
For each map1_ *.pdf
Walk through the directory and append map2_ *.pdf where the numbers (where the * is) in the file name match
There must be a way to do it using python. Maybe with a combination of arcpy, os.walk or os.listdir, and pyPdf and a for loop?
for pdf in os.walk(datadirectory):
??
Any ideas? Thanks kindly for your help.
A PDF file is structured in a different way than a plain text file. Simply putting two PDF files together wouldn't work, as the file's structure and contents could be overwritten or become corrupt. You could certainly author your own, but that would take a fair amount of time, and intimate knowledge of how a PDF is internally structured.
That said, I would recommend that you look into pyPDF. It supports the merging feature that you're looking for.
This should properly find and collate all the files to be merged; it still needs the actual .pdf-merging code.
Edit: I have added pdf-writing code based on the pyPdf example code. It is not tested, but should (as nearly as I can tell) work properly.
Edit2: realized I had the map-numbering crossways; rejigged it to merge the right sets of maps.
import collections
import glob
import re
# probably need to install this module -
# pip install pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
def group_matched_files(filespec, reg, keyFn, dataFn):
res = collections.defaultdict(list)
reg = re.compile(reg)
for fname in glob.glob(filespec):
data = reg.match(fname)
if data is not None:
res[keyFn(data)].append(dataFn(data))
return res
def merge_pdfs(fnames, newname):
print("Merging {} to {}".format(",".join(fnames), newname))
# create new output pdf
newpdf = PdfFileWriter()
# for each file to merge
for fname in fnames:
with open(fname, "rb") as inf:
oldpdf = PdfFileReader(inf)
# for each page in the file
for pg in range(oldpdf.getNumPages()):
# copy it to the output file
newpdf.addPage(oldpdf.getPage(pg))
# write finished output
with open(newname, "wb") as outf:
newpdf.write(outf)
def main():
matches = group_matched_files(
"map*.pdf",
"map(\d+)_(\d+).pdf$",
lambda d: "{}".format(d.group(2)),
lambda d: "map{}_".format(d.group(1))
)
for map,pages in matches.iteritems():
merge_pdfs((page+map+'.pdf' for page in sorted(pages)), "merged{}.pdf".format(map))
if __name__=="__main__":
main()
I don't have any test pdfs to try and combine but I tested with a cat command on text files.
You can try this out (I'm assuming unix based system): merge.py
import os, re
files = os.listdir("/home/user/directory_with_maps/")
files = [x for x in files if re.search("map1_", x)]
while len(files) > 0:
current = files[0]
search = re.search("_(\d+).pdf", current)
if search:
name = search.group(1)
cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=FULLMAP_%s.pdf %s map2_%s.pdf" % (name, current, name)
os.system(cmd)
files.remove(current)
Basically it goes through and grabs the maps1 list and then just goes through and assumes correct files and just goes through numbers. (I can see using a counter to do this and padding with 0's to get similar effect).
Test the gs command first though, I just grabbed it from http://hints.macworld.com/article.php?story=2003083122212228.
There are examples of how to to do this on the pdfrw project page at googlecode:
http://code.google.com/p/pdfrw/wiki/ExampleTools

Categories

Resources