How can I remove every other page of a pdf with python? - python

I downloaded a pdf where every other page is blank and I'd like to remove the blank pages. I could do this manually in a pdf tool (Adobe Acrobat, Preview.app, PDFPen, etc.) but since it's several hundred pages I'd like to do something more automated. Is there a way to do this in python?

One way is to use pypdf, so in your terminal first do
pip install pypdf4
Then create a .py script file similar to this:
# pdf_strip_every_other_page.py
from PyPDF4 import PdfFileReader, PdfFileWriter
number_of_pages = 500
output_writer = PdfFileWriter()
with open("/path/to/original.pdf", "rb") as inputfile:
pdfOne = PdfFileReader(inputfile)
for i in list(range(0, number_of_pages)):
if i % 2 == 0:
page = pdfOne.getPage(i)
output_writer.addPage(page)
with open("/path/to/output.pdf", "wb") as outfile:
output_writer.write(outfile)
Note: you'll need to change the paths to what's appropriate for your scenario.
Obviously this script is rather crude and could be improved, but wanted to share it for anyone else wanting a quick way to deal with this scenario.

Related

PyPDF2 doesn't display barcode (pdf417) after using update_page_form_field_values()

I'm trying to modify a pdf found here
https://www.uscis.gov/sites/default/files/document/forms/i-130.pdf
It has a barcode at the bottom that I need to keep. However if I use the update_page_form_field_values() function the pdf doesn't show the barcode. How do I prevent this?
from PyPDF2 import PdfFileReader,PdfFileWriter
reader = PdfFileReader("i-130.pdf")
pager = reader.getPage(0)
field_dict = {
"Pt2Line4b_GivenName[0]": "Mark"
}
writer = PdfFileWriter()
writer.add_page(pager)
writer.update_page_form_field_values(pager,fields=field_dict)
with open("newfile.pdf", "wb") as fh:
writer.write(fh)
I've tried modifying the basic fields by accessing the forms directly, but I have issues with all the forms showing up. Run this code snippet separately, but you can't update the GivenName field directly
from PyPDF2 import PdfFileReader,PdfFileWriter
from PyPDF2.generic import NameObject, IndirectObject, BooleanObject
reader = PdfFileReader("i-130.pdf")
pager = reader.getPage(0)
annot3 = pager['/Annots'][18].get_object()
annot3.update({NameObject("/V"):NameObject("Mark")})
writer = PdfFileWriter()
writer._root_object.update({NameObject("/AcroForm"):IndirectObject(len(writer._objects),0,writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
with open("newfile.pdf", "wb") as fh:
writer.write(fh)
Unfortunately, I can't really give you any details on why this is so and what negative effects might result from approaching it this way, BUT...
The issue is driven by the self.set_need_appearances_writer() call at the top of the update_page_form_field_values function definition. To test, I commented that line out in my PyPDF2 distribution (see how-to below, but be advised, this is NOT a good long term solution), and the following code:
from PyPDF2 import PdfFileReader, PdfFileWriter
rdr = PdfFileReader("i-130.pdf")
fields_dict = {
"Pt2Line4b_GivenName[0]": "Mark"
}
wtr = PdfFileWriter()
wtr.append_pages_from_reader(rdr)
# wtr.set_need_appearances_writer()
wtr.update_page_form_field_values(wtr.pages[0], fields_dict)
with open("i-130-prefilled.pdf", "wb") as ofile:
wtr.write(ofile)
produces output:
The bad news is that the resultant PDF is quite a bit smaller than the original (~150 KB smaller), so even if it looks like an identical copy with a few fields filled out, there's clearly something more that's lacking. One of those "things" is the font in which the field values are rendered. The original version rendered entered values in a fixed width font like courier new, but the modified version displays a standard "sans" font. Conversely, you can replace the append_pages_from_reader call with clone_document_from_reader in which case you'll end up with a prepopulated version that DOES preserve the font, but the file size is ~300 KB larger than the original. No matter how you slice it, something is "not the same".
Bottom line:
I'm not sure I'd go with any solution without understanding what happens "under the hood" with that set_need_appearances_writer call. As such, I'd advise direct engagement with the good folks who picked up the baton and started actively working on the PyPDF2 library again earlier this year. You can find them at https://github.com/py-pdf/PyPDF2.
As promised, here's how you can test the solution in your own environment (but please don't do this and call it a day ;)):
navigate to the dist folder for your PyPDF2 install and open _writer.py (in VSCode, you can right click on the call to update_page_form_field_values and select 'Go to Definition')
comment out the offending line as shown below:

How can I split a PDF file by its pages into PDF files of a smaller size with PyPDF2?

I wrote simple python code that gets PDF, goes over its pages using PyPDF2 and saves each page as new PDF file.
see page save function here:
from PyPDF2 import PdfReader, PdfWriter
def save_pdf_page(file_name, page_index):
reader = PdfReader(file_name)
writer = PdfWriter()
writer.add_page(reader.pages[page_index])
writer.remove_links()
with open(f"output_page{page_index}.pdf", "wb") as fh:
writer.write(fh)
Surprisingly each page is about the same size as the original PDF file.
using removeLinks (taken from here) didn't reduce page size
I found similar question here, saying it may be caused because PyPDF output files are uncompressed.
Is there a way using PyPDF or any other python lib to make each page relatively small as expected?
You are running into this issue: https://github.com/py-pdf/PyPDF2/issues/449
Essentially the are two problems:
Every page might need a resource which is shared, eg font information
PyPDF2 might not realize if some pages don't need it
Remove links might help. Additionally, you might want to follow the docs to reduce file size:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
writer = PdfWriter()
for page_num in [2, 3]:
page = reader.pages[page_num]
# This is CPU intensive! It ZIPs the contents of the page
page.compress_content_streams()
writer.add_page(page)
with open("seperate.pdf", "wb") as fh:
writer.remove_links()
writer.write(fh)

pyPdf PdfFileReader vs PdfFileWriter

I have the following code:
import os
from pyPdf import PdfFileReader, PdfFileWriter
path = "C:/Real Python/Course materials/Chapter 12/Practice files"
input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_PDF = PdfFileWriter()
for page_num in range(1, 4):
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "Output/portion.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
Till now I was just reading from Pdfs and later learned to write from Pdf to txt... But now this...
Why the PdfFileReader differs so much from PdfFileWriter
Can someone explain this? I would expect something like:
import os
from pyPdf import PdfFileReader, PdfFileWriter
path = "C:/Real Python/Course materials/Chapter 12/Practice files"
input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_file_name = os.path.join(path, "out Pride and Prejudice.pdf")
output_file = PdfFileWriter(file(output_file_name, "wb"))
for page_num in range(1,4):
page = input_file.petPage(page_num)
output_file.addPage(page_num)
output_file.write(page)
Any help???
Thanks
EDIT 0: What does .addPage() do?
for page_num in range(1, 4):
output_PDF.addPage(input_file.getPage(page_num))
Does it just creates 3 BLANK pages?
EDIT 1: Someone can explain what happends when:
1) output_PDF = PdfFileWriter()
2) output_PDF.addPage(input_file.getPage(page_num))
3) output_PDF.write(output_file)
The 3rd one passes a JUST CREATED(!) object to output_PDF , why?
The issue is basically the PDF Cross-Reference table.
It's a somewhat tangled spaghetti monster of references to pages, fonts, objects, elements, and these all need to link together to allow for random access.
Each time a file is updated, it needs to rebuild this table. The file is created in memory first so this only has to happen once, and further decreasing the chances of torching your file.
output_PDF = PdfFileWriter()
This creates the space in memory for the PDF to go into. (to be pulled from your old pdf)
output_PDF.addPage(input_file.getPage(page_num))
add the page from your input pdf, to the PDF file created in memory (the page you want.)
output_PDF.write(output_file)
Finally, this writes the object stored in memory to a file, building the header, cross-reference table, and linking everything together all hunky dunky.
Edit: Presumably, the JUST CREATED flag signals PyPDF to start building the appropriate tables and link things together.
--
in response to the why vs .txt and csv:
When you're copying from a text or CSV file, there's no existing data structures to comprehend and move to make sure things like formatting, image placement, and form data (input sections, etc) are preserved and created properly.
Most likely, it's done because PDFs aren't exactly linear - the "header" is actually at the end of the file.
If the file was written to disk every time a change was made, your computer needs to keep pushing that data around on the disk. Instead, the module (probably) stores the information about the document in an object (PdfFileWriter), and then converts that data into your actual PDF file when you request it.

how to get a content from a pdf file and store it in a txt file

import pyPdf
f= open('jayabal_appt.pdf','rb')
pdfl = pyPdf.PdfFileReader(f)
content=""
for i in range(0,1):
content += pdfl.getPage(i).extractText() + "\n"
outpu = open('b.txt','wb')
outpu.write(content)
f.close()
outpu.close()
This is not getting the content from a pdf file and storing it in a txt file... What is the mistake in this code????
A simple example from the author suggest doing this (You don't seem to be doing 'file'):
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("jayabal_appt.pdf", "rb"))
Then you can do the following:
output.addPage(input1.getPage(0))
And sure, use a for loop for it, but the author doesn't suggest using extractText.
Just check out the website, the example is rather straight forward: https://pypi.org/project/pypdf/
However
pyPdf is no longer maintained, so I don't recommend using it. The author suggest to check out pyPdf2 instead.
A simple Google search also suggest that you should try pdftotext or pdfminer. There are plenty of examples out there.
Good luck.

Combine two lists of PDFs one to one using Python

I have created a series of PDF documents (maps) using data driven pages in ESRI ArcMap 10. There is a page 1 and page 2 for each map generated from separate *.mxd. So I have one list of PDF documents containing page 1 for each map and one list of PDF documents containing page 2 for each map. For example: Map1_001.pdf, map1_002.pdf, map1_003.pdf...map2_001.pdf, map2_002.pdf, map2_003.pdf...and so one.
I would like to append these maps, pages 1 and 2, together so that both page 1 and 2 are together in one PDF per map. For example: mapboth_001.pdf, mapboth_002.pdf, mapboth_003.pdf... (they don't have to go into a new pdf file (mapboth), it's fine to append them to map1)
For each map1_ *.pdf
Walk through the directory and append map2_ *.pdf where the numbers (where the * is) in the file name match
There must be a way to do it using python. Maybe with a combination of arcpy, os.walk or os.listdir, and pyPdf and a for loop?
for pdf in os.walk(datadirectory):
??
Any ideas? Thanks kindly for your help.
A PDF file is structured in a different way than a plain text file. Simply putting two PDF files together wouldn't work, as the file's structure and contents could be overwritten or become corrupt. You could certainly author your own, but that would take a fair amount of time, and intimate knowledge of how a PDF is internally structured.
That said, I would recommend that you look into pyPDF. It supports the merging feature that you're looking for.
This should properly find and collate all the files to be merged; it still needs the actual .pdf-merging code.
Edit: I have added pdf-writing code based on the pyPdf example code. It is not tested, but should (as nearly as I can tell) work properly.
Edit2: realized I had the map-numbering crossways; rejigged it to merge the right sets of maps.
import collections
import glob
import re
# probably need to install this module -
# pip install pyPdf
from pyPdf import PdfFileWriter, PdfFileReader
def group_matched_files(filespec, reg, keyFn, dataFn):
res = collections.defaultdict(list)
reg = re.compile(reg)
for fname in glob.glob(filespec):
data = reg.match(fname)
if data is not None:
res[keyFn(data)].append(dataFn(data))
return res
def merge_pdfs(fnames, newname):
print("Merging {} to {}".format(",".join(fnames), newname))
# create new output pdf
newpdf = PdfFileWriter()
# for each file to merge
for fname in fnames:
with open(fname, "rb") as inf:
oldpdf = PdfFileReader(inf)
# for each page in the file
for pg in range(oldpdf.getNumPages()):
# copy it to the output file
newpdf.addPage(oldpdf.getPage(pg))
# write finished output
with open(newname, "wb") as outf:
newpdf.write(outf)
def main():
matches = group_matched_files(
"map*.pdf",
"map(\d+)_(\d+).pdf$",
lambda d: "{}".format(d.group(2)),
lambda d: "map{}_".format(d.group(1))
)
for map,pages in matches.iteritems():
merge_pdfs((page+map+'.pdf' for page in sorted(pages)), "merged{}.pdf".format(map))
if __name__=="__main__":
main()
I don't have any test pdfs to try and combine but I tested with a cat command on text files.
You can try this out (I'm assuming unix based system): merge.py
import os, re
files = os.listdir("/home/user/directory_with_maps/")
files = [x for x in files if re.search("map1_", x)]
while len(files) > 0:
current = files[0]
search = re.search("_(\d+).pdf", current)
if search:
name = search.group(1)
cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=FULLMAP_%s.pdf %s map2_%s.pdf" % (name, current, name)
os.system(cmd)
files.remove(current)
Basically it goes through and grabs the maps1 list and then just goes through and assumes correct files and just goes through numbers. (I can see using a counter to do this and padding with 0's to get similar effect).
Test the gs command first though, I just grabbed it from http://hints.macworld.com/article.php?story=2003083122212228.
There are examples of how to to do this on the pdfrw project page at googlecode:
http://code.google.com/p/pdfrw/wiki/ExampleTools

Categories

Resources