I have the following code:
import os
from pyPdf import PdfFileReader, PdfFileWriter
path = "C:/Real Python/Course materials/Chapter 12/Practice files"
input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_PDF = PdfFileWriter()
for page_num in range(1, 4):
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "Output/portion.pdf")
output_file = file(output_file_name, "wb")
output_PDF.write(output_file)
output_file.close()
Till now I was just reading from Pdfs and later learned to write from Pdf to txt... But now this...
Why the PdfFileReader differs so much from PdfFileWriter
Can someone explain this? I would expect something like:
import os
from pyPdf import PdfFileReader, PdfFileWriter
path = "C:/Real Python/Course materials/Chapter 12/Practice files"
input_file_name = os.path.join(path, "Pride and Prejudice.pdf")
input_file = PdfFileReader(file(input_file_name, "rb"))
output_file_name = os.path.join(path, "out Pride and Prejudice.pdf")
output_file = PdfFileWriter(file(output_file_name, "wb"))
for page_num in range(1,4):
page = input_file.petPage(page_num)
output_file.addPage(page_num)
output_file.write(page)
Any help???
Thanks
EDIT 0: What does .addPage() do?
for page_num in range(1, 4):
output_PDF.addPage(input_file.getPage(page_num))
Does it just creates 3 BLANK pages?
EDIT 1: Someone can explain what happends when:
1) output_PDF = PdfFileWriter()
2) output_PDF.addPage(input_file.getPage(page_num))
3) output_PDF.write(output_file)
The 3rd one passes a JUST CREATED(!) object to output_PDF , why?
The issue is basically the PDF Cross-Reference table.
It's a somewhat tangled spaghetti monster of references to pages, fonts, objects, elements, and these all need to link together to allow for random access.
Each time a file is updated, it needs to rebuild this table. The file is created in memory first so this only has to happen once, and further decreasing the chances of torching your file.
output_PDF = PdfFileWriter()
This creates the space in memory for the PDF to go into. (to be pulled from your old pdf)
output_PDF.addPage(input_file.getPage(page_num))
add the page from your input pdf, to the PDF file created in memory (the page you want.)
output_PDF.write(output_file)
Finally, this writes the object stored in memory to a file, building the header, cross-reference table, and linking everything together all hunky dunky.
Edit: Presumably, the JUST CREATED flag signals PyPDF to start building the appropriate tables and link things together.
--
in response to the why vs .txt and csv:
When you're copying from a text or CSV file, there's no existing data structures to comprehend and move to make sure things like formatting, image placement, and form data (input sections, etc) are preserved and created properly.
Most likely, it's done because PDFs aren't exactly linear - the "header" is actually at the end of the file.
If the file was written to disk every time a change was made, your computer needs to keep pushing that data around on the disk. Instead, the module (probably) stores the information about the document in an object (PdfFileWriter), and then converts that data into your actual PDF file when you request it.
Related
I'm trying to modify a pdf found here
https://www.uscis.gov/sites/default/files/document/forms/i-130.pdf
It has a barcode at the bottom that I need to keep. However if I use the update_page_form_field_values() function the pdf doesn't show the barcode. How do I prevent this?
from PyPDF2 import PdfFileReader,PdfFileWriter
reader = PdfFileReader("i-130.pdf")
pager = reader.getPage(0)
field_dict = {
"Pt2Line4b_GivenName[0]": "Mark"
}
writer = PdfFileWriter()
writer.add_page(pager)
writer.update_page_form_field_values(pager,fields=field_dict)
with open("newfile.pdf", "wb") as fh:
writer.write(fh)
I've tried modifying the basic fields by accessing the forms directly, but I have issues with all the forms showing up. Run this code snippet separately, but you can't update the GivenName field directly
from PyPDF2 import PdfFileReader,PdfFileWriter
from PyPDF2.generic import NameObject, IndirectObject, BooleanObject
reader = PdfFileReader("i-130.pdf")
pager = reader.getPage(0)
annot3 = pager['/Annots'][18].get_object()
annot3.update({NameObject("/V"):NameObject("Mark")})
writer = PdfFileWriter()
writer._root_object.update({NameObject("/AcroForm"):IndirectObject(len(writer._objects),0,writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
with open("newfile.pdf", "wb") as fh:
writer.write(fh)
Unfortunately, I can't really give you any details on why this is so and what negative effects might result from approaching it this way, BUT...
The issue is driven by the self.set_need_appearances_writer() call at the top of the update_page_form_field_values function definition. To test, I commented that line out in my PyPDF2 distribution (see how-to below, but be advised, this is NOT a good long term solution), and the following code:
from PyPDF2 import PdfFileReader, PdfFileWriter
rdr = PdfFileReader("i-130.pdf")
fields_dict = {
"Pt2Line4b_GivenName[0]": "Mark"
}
wtr = PdfFileWriter()
wtr.append_pages_from_reader(rdr)
# wtr.set_need_appearances_writer()
wtr.update_page_form_field_values(wtr.pages[0], fields_dict)
with open("i-130-prefilled.pdf", "wb") as ofile:
wtr.write(ofile)
produces output:
The bad news is that the resultant PDF is quite a bit smaller than the original (~150 KB smaller), so even if it looks like an identical copy with a few fields filled out, there's clearly something more that's lacking. One of those "things" is the font in which the field values are rendered. The original version rendered entered values in a fixed width font like courier new, but the modified version displays a standard "sans" font. Conversely, you can replace the append_pages_from_reader call with clone_document_from_reader in which case you'll end up with a prepopulated version that DOES preserve the font, but the file size is ~300 KB larger than the original. No matter how you slice it, something is "not the same".
Bottom line:
I'm not sure I'd go with any solution without understanding what happens "under the hood" with that set_need_appearances_writer call. As such, I'd advise direct engagement with the good folks who picked up the baton and started actively working on the PyPDF2 library again earlier this year. You can find them at https://github.com/py-pdf/PyPDF2.
As promised, here's how you can test the solution in your own environment (but please don't do this and call it a day ;)):
navigate to the dist folder for your PyPDF2 install and open _writer.py (in VSCode, you can right click on the call to update_page_form_field_values and select 'Go to Definition')
comment out the offending line as shown below:
I wrote simple python code that gets PDF, goes over its pages using PyPDF2 and saves each page as new PDF file.
see page save function here:
from PyPDF2 import PdfReader, PdfWriter
def save_pdf_page(file_name, page_index):
reader = PdfReader(file_name)
writer = PdfWriter()
writer.add_page(reader.pages[page_index])
writer.remove_links()
with open(f"output_page{page_index}.pdf", "wb") as fh:
writer.write(fh)
Surprisingly each page is about the same size as the original PDF file.
using removeLinks (taken from here) didn't reduce page size
I found similar question here, saying it may be caused because PyPDF output files are uncompressed.
Is there a way using PyPDF or any other python lib to make each page relatively small as expected?
You are running into this issue: https://github.com/py-pdf/PyPDF2/issues/449
Essentially the are two problems:
Every page might need a resource which is shared, eg font information
PyPDF2 might not realize if some pages don't need it
Remove links might help. Additionally, you might want to follow the docs to reduce file size:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("test.pdf")
writer = PdfWriter()
for page_num in [2, 3]:
page = reader.pages[page_num]
# This is CPU intensive! It ZIPs the contents of the page
page.compress_content_streams()
writer.add_page(page)
with open("seperate.pdf", "wb") as fh:
writer.remove_links()
writer.write(fh)
I downloaded a pdf where every other page is blank and I'd like to remove the blank pages. I could do this manually in a pdf tool (Adobe Acrobat, Preview.app, PDFPen, etc.) but since it's several hundred pages I'd like to do something more automated. Is there a way to do this in python?
One way is to use pypdf, so in your terminal first do
pip install pypdf4
Then create a .py script file similar to this:
# pdf_strip_every_other_page.py
from PyPDF4 import PdfFileReader, PdfFileWriter
number_of_pages = 500
output_writer = PdfFileWriter()
with open("/path/to/original.pdf", "rb") as inputfile:
pdfOne = PdfFileReader(inputfile)
for i in list(range(0, number_of_pages)):
if i % 2 == 0:
page = pdfOne.getPage(i)
output_writer.addPage(page)
with open("/path/to/output.pdf", "wb") as outfile:
output_writer.write(outfile)
Note: you'll need to change the paths to what's appropriate for your scenario.
Obviously this script is rather crude and could be improved, but wanted to share it for anyone else wanting a quick way to deal with this scenario.
I am making a pdf splitter and at first seemed to work fine. But when i tryed to use multiple page regions , i keep getting this error--> ValueError: seek of closed file.
If i omit pdf_file.close() the error will stop but all the pdf created will have no pages.
My code is here:
from PyPDF2 import PdfFileReader , PdfFileWriter
counter = 1
pdf_file = open(fileName2,'rb')
pdf_reader = PdfFileReader(pdf_file)
pdf_writer = PdfFileWriter()
output_file2 , _ = QtWidgets.QFileDialog.getSaveFileName(self, "Save file", fileName2_c2+"_splited", "Folder will be created")
os.makedirs(r'{}'.format(output_file2+"\\{}_splited".format(fileName2_c2)))
for z in list_pdf_split:
try:
pdf_file = open(fileName2,'rb')
except:
print("error")
print(z)
c_z = z.split("-")
for i in range(int(c_z[0]),int(c_z[1])+1):
print(i)
pdf_writer.addPage(pdf_reader.getPage(i-1))
output_file = open(output_file2+"\\{}_splited".format(fileName2_c2)+"{}".format(counter)+".pdf",'wb')
pdf_reader = PdfFileReader(pdf_file)
pdf_writer = PdfFileWriter()
pdf_writer.write(output_file)
output_file.close()
counter +=1
pdf_file.close()
Your logic doesn't make much sense, in multiple places.
First, the problem you're asking about. Look at what you do with pdf_file and pdf_reader:
Open a file as pdf_file.
Create a PdfFileReader attached to pdf_file as pdf_reader.
Reopen the same file as pdf_file. This releases the old file, causing it to become garbage, so it will soon (usually immediately) get closed.
Repeatedly call getPage(:-1) on pdf_reader, which is probably attached to a closed file the first time, and definitely every time after that.
Create a new PdfFileReader with the file we opened in step 3, as pdf_reader.
Close the pdf_file you just opened, so pdf_reader is definitely now referencing a closed file.
Repeat steps 2-6.
You either need to do step 4 before step 3, or after step 5, or you need to have two different pdf_file variables so you can open the new one while still using the old one. I'm not sure which of the three you want, but as-is, you're reading from a closed file.
But I think it would be simpler to reorganize things to eliminate step 1—instead of trying to opening stuff before the loop and then reopen stuff at the end of each loop, you just open stuff at the start of the loop, right where you need it.
Second, your writer is just as confused. Look at what you do with output_file and pdf_writer:
Create PdfFileWriter as pdf_writer.
Repeatedly add pages to it.
Open an output file as output_file.
Create a new PdfFileWriter as pdf_writer, discarding everything you wrote to the old one.
Write out the now-empty pdf_writer to output_file.
Repeat steps 2-5.
Again, you need to do step 5 somewhere else, probably before step 4. But, again, it's probably a lot simpler to reorganize things to eliminate step 1.
Sorry i suppose i was too fast answering this question.
I moved pdf.writer and pdf.reader to the begining of for loops as it seems to block code (stream for writing pdf).
The following code has successfully split a large PDF file into smaller PDFs of 2 pages each. However, if I look into one of the files, I see meta-text from others.
This is used to split the PDF into smaller ones:
import numpy as np
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open(path+"multi.pdf", "rb"))
r=np.arange(inputpdf.numPages)
r2=[(r[i],r[i+1]) for i in range(0,len(r),2)]
for i in r2:
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i[0]))
output.addPage(inputpdf.getPage(i[1]))
with open(path+"document-page %s.pdf" % i[0], "wb") as outputStream:
output.write(outputStream)
This is used to get the meta-text of one of the resulting files (PyPDF2 will not read it):
import pdfx
path=path+'document-page 8.pdf'
pdf = pdfx.PDFx(path)
pdf.get_text()
My issues with this are:
The process is super slow and all I want is a 10 digit number in the upper-right corner of the first page. Can I somehow just get that part?
When looking at the result, it has text from other adjacent pages from the original PDF file (which is why I'm calling it "meta-text"). Why is that? Can this be resolved?
Update:
pdf.get_references_count()
...shows 20 (there should only be 2)
Thanks in advance!