Python PyPDF2 'PdfFileReader' object has no attribute 'scaleTo' error - python

This is my first program so I imagine there are a lot of inefficiencies. First I created a GUI that works on a combined PDF. In attempting to convert the working code to a code that iterates through a directory of multiple single page PDF's, I get an error. On the "PageObj.scaleTo(1172, 1772)" line I get the error in the question title. A GUI takes the user inputs for the variables "x" (directory), "a" (paper size), and "s" (state). It is to resize the page to the selected size, merge with a template (not append but a single page "PDF sandwich" I have heard it described), then overwrite the existing file. This is to happen to every PDF in the specified directory. I have tried several version of defining my PageObj variable, but can't seem to get it right.
# Variables for User input values
x = values["-pdf_dir-"]
a = values["-paper_size-"]
s = values["-state-"]
# Location to find seal templates
state = f"G:/Drafting/Kain Mincey/Allen's seals/Correctly Sized/{a}/{s}.pdf"
Seal_pdf = PdfFileReader(open(state, "rb"), strict=False)
input_pdf = glob.glob(os.path.join(x, '*.pdf'))
output_pdf = PdfFileWriter()
page_count = len(fnmatch.filter(os.listdir(x), '*.pdf'))
i = 0
if a == "11x17":
for file in input_pdf:
sg.OneLineProgressMeter('My Meter', i, page_count, 'And now we Wait.....')
PageObj = PyPDF2.PdfFileReader(open(file, "rb"))
PageObj.scaleTo(11*72, 17*72)
PageObj.mergePage(Seal_pdf.getPage(0))
output_pdf.addPage(PageObj)
output_filename = f"{x[:-4]}"
i = i + 1

PdfFileReader returns the whole file. scaleTo applies to a page. You have to fetch the page you want with getPage. –Tim Roberts Mar 28 at 21:02

Related

pyPDF2 PdfFileWriter output returns a corrupted file

I am very new to python. I have the following code that takes user input from a GUI for the "x" and "a" variable. The goal is to have it open each .pdf in the directory perform the modifications, and save over itself. Each pdf in the directory is a single page pdf. It seems to work however, the newly saved file is corrupted and cannot be opened.
Seal_pdf = PdfFileReader(open(state, "rb"), strict=False)
input_pdf = glob.glob(os.path.join(x, '*.pdf'))
output_pdf = PdfFileWriter()
page_count = len(fnmatch.filter(os.listdir(x), '*.pdf'))
i = 0
if a == "11x17":
for file in input_pdf:
sg.OneLineProgressMeter('My Meter', i, page_count, 'And now we Wait.....')
PageObj = PyPDF2.PdfFileReader(open(file, "rb"), strict=False).getPage(0)
PageObj.scaleTo(11*72, 17*72)
PageObj.mergePage(Seal_pdf.getPage(0))
output_filename = f"{file}"
f = open(output_filename, "wb+")
output_pdf.write(f)
i = i + 1
Adding output_pdf.addPage(PageObj) to the loop produces and uncorrupted file however, that causes each successive .pdf to be added to the previous .pdf. (ex. "pdf 1" is only "pdf 1", "pdf2 is now two pages "pdf1" and "pdf2" merged, etc.). I also attempted to change the next to last two lines to
with open(output_filename, "wb+") as f:
output_pdf.write(f)
with no luck. I can't figure out what I am missing to have the PdfFileWriter return a single page, uncorrupted file for each individual pdf in the directory.
if a == "11x17":
for file in input_pdf:
sg.OneLineProgressMeter('My Meter', i, page_count, 'And now we Wait.....')
PageObj = PyPDF2.PdfFileReader(open(file, "rb"), strict=False).getPage(0)
PageObj.scaleTo(11*72, 17*72)
PageObj.mergePage(Seal_pdf.getPage(0))
output_pdf.addPage(PageObj)
output_filename = f"{file}"
f = open(output_filename, "wb+")
output_pdf.write(f)
i = i + 1
I was able to solve this finally by simply putting the output_pdf = PdfFileWriter() inside the loop. I stumbled across that being the solution for another loop issue and thought I would try it. PdfFileWriter() inside loop

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.
Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw.
Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.
I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!
Some example code:
1) With PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand
So I dont know how to proceed from here
2) With pdfrw (has more promise)
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good
Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again
zp
As mentioned in my comment, I'm posting a generic solution to merge several pdfs that works in PyPDF2. Dont know what is different to make this work in PyPDF2 other than initializing pls as ArrayObject()
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
import PyPDF2.pdf as PDF
# pls holds all the pagelabels as we iterate through multiple pdfs
pls = PDF.ArrayObject()
# used to offset bookmarks
pageCount = 0
cpdf = PdfFileMerger()
# pdffiles is a list of all files to be merged
for i in range(len(pdffiles)):
tmppdf = PdfFileReader(pdffiles[i], 'rb')
cpdf.append(tmppdf)
# copy all the pagelabels which I assume is present in all files
# you could use 'try' in case no pagelabels are present
plstmp = tmppdf.trailer['/Root']['/PageLabels']['/Nums']
# sometimes keys are indirect objects
# so, iterate through each pagelabel and...
for j in range(len(plstmp)):
# ... get the actual values
plstmp[j] = plstmp[j].getObject()
# offset pagenumbers by current count of pages
if isinstance(plstmp[j], int):
plstmp[j] = PDF.NumberObject(plstmp[j] + pageCount)
# once all the pagelabels are processed I append to pls
pls += plstmp
#increment pageCount
pageCount += tmppdf.getNumPages()
# rest follows KevinM's answer
pagenums = PDF.DictionaryObject()
pagenums.update({PDF.NameObject('/Nums') : pls})
pagelabels = PDF.DictionaryObject()
pagelabels.update({PDF.NameObject('/PageLabels') : pagenums})
cpdf.output._root_object.update(pagelabels)
cpdf.write("filename.pdf")
You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.
This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).
from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader
# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF
import logging
def add_nums(num_entry, page_offset, nums_array):
for num in num_entry['/Nums']:
if isinstance(num, (int)):
logging.debug("Found page number %s, offset %s: ", num, page_offset)
# Add the physical page information
nums_array.append(PDF.NumberObject(num+page_offset))
else:
# {'/S': '/r'}, or {'/S': '/D', '/St': 489}
keys = num.keys()
logging.debug("Found page label, keys: %s", keys)
number_type = PDF.DictionaryObject()
# Always copy the /S entry
s_entry = num['/S']
number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
logging.debug("Adding /S entry: %s", s_entry)
if '/St' in keys:
# If there is an /St entry, fetch it
pdf_label_offset = num['/St']
# and add the new offset to it
logging.debug("Found /St %s", pdf_label_offset)
number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})
# Add the label information
nums_array.append(number_type)
return nums_array
def write_merged(pdf_readers):
# Output
merger = PdfFileMerger()
# For PageLabels information
page_labels = []
page_offset = 0
nums_array = PDF.ArrayObject()
# Iterate through all the inputs
for pdf_reader in pdf_readers:
try:
# Merge the content
merger.append(pdf_reader)
# Handle the PageLabels
# Fetch page information
old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
page_count = pdf_reader.getNumPages()
# Add PageLabel information
add_nums(old_page_labels, page_offset, nums_array)
page_offset = page_offset + page_count
except Exception as err:
print("ERROR: %s" % err)
# Add PageLabels
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"): nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = merger.output._root_object
root_obj.update(page_labels)
# Write output
merger.write('merged.pdf')
pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)
write_merged(pdf_readers)
Note: PyPDF2 produces this weird error:
...
...
File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
data[key] = value
File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

How to add images with figure captions to a word document

import win32com.client as win32
import os
#creating a word application object
wordApp = win32.gencache.EnsureDispatch('Word.Application') #create a word application object
wordApp.Visible = True # hide the word application
doc = wordApp.Documents.Add() # create a new application
#Formating the document
doc.PageSetup.RightMargin = 10
doc.PageSetup.LeftMargin = 10
doc.PageSetup.Orientation = win32.constants.wdOrientLandscape
# a4 paper size: 595x842
doc.PageSetup.PageWidth = 595
doc.PageSetup.PageHeight = 842
# Inserting Tables
my_dir="C:/Users/David/Documents/EGi/EGi Plots/FW_plots/Boxplots"
filenames = os.listdir(my_dir)
piccount=0
file_count = 0
for i in filenames:
if i[len(i)-3: len(i)].upper() == 'JPG': # check whether the current object is a JPG file
piccount = piccount + 1
print piccount, " images will be inserted"
total_column = 1
total_row = int(piccount/total_column)+2
rng = doc.Range(0,0)
rng.ParagraphFormat.Alignment = win32.constants.wdAlignParagraphCenter
table = doc.Tables.Add(rng,total_row, total_column)
table.Borders.Enable = False
if total_column > 1:
table.Columns.DistributeWidth()
#Collecting images in the same directory and inserting them into the document
piccount = 1
for index, filename in enumerate(filenames): # loop through all the files and folders for adding pictures
if os.path.isfile(os.path.join(os.path.abspath(my_dir), filename)): # check whether the current object is a file or not
if filename[len(filename)-3: len(filename)].upper() == 'JPG': # check whether the current object is a JPG file
piccount = piccount + 1
print filename, len(filename), filename[len(filename)-3: len(filename)].upper()
cell_column = (piccount % total_column + 1) #calculating the position of each image to be put into the correct table cell
cell_row = (piccount/total_column + 1)
#print 'cell_column=%s,cell_row=%s' % (cell_column,cell_row)
#we are formatting the style of each cell
cell_range= table.Cell(cell_row, cell_column).Range
cell_range.ParagraphFormat.LineSpacingRule = win32.constants.wdLineSpaceSingle
cell_range.ParagraphFormat.SpaceBefore = 0
cell_range.ParagraphFormat.SpaceAfter = 3
#this is where we are going to insert the images
current_pic=cell_range.InlineShapes.AddPicture(os.path.join(os.path.abspath(my_dir), filename))
#Currently this puts a lable in a cell after the pic, I want to put a proper ms word figure caption below the image instead.
table.Cell(cell_row, cell_column).Range.InsertAfter("\n"+"Appendix II Figure "+ str(piccount-1)+": "+filename[:len(filename)-4]+"\n"+"\n"+"\n")
else: continue
This code gets all the images in a chosen directory and puts them in a table in a word doc, and then puts the file name (stripped of the file extn) in the cell below. I would like a proper figure caption (so that these will update if I insert additional pictures) but everything I've tried has failed.
I just can't get the VB commands right, this:
table.Cell(cell_row, cell_column).Range.InsertAfter(InsertCaption(Label="Figure", Title=": "+filename[:len(filename)-4]))
gives me a list of figure captions at the end of the document, which isn't really what I want. I feel like I am close but I just cant quite get it. Thanks!
In order to use Word's built-in captioning instead of current_pic.InsertCaption use current_Pic.Range.InsertCaption. The InsertCaption method is a member of the Range not the InlineShape object. For me, this automatically inserts the caption below the picture, in its own paragraph. But if you want to specificy "below" use the Position argument, as well:
current_pic.Range.InsertCaption(Label="Figure", Title=": "+filename[:len(filename)-4]), Position=win32.constants.wdCaptionPositionBelow
Note: FWIW when I test the line of code (in VBA) that you say gives you a list of captions at the end of the document I do see the text in the same cell as the inserted picture.

PyPDF2 & ReportLab editing a PDF and merging multiple pages

I'm trying to add some text (page numbers) to an existing PDF file.
Using PyPDF2 package iterating through the original file, creating a canvas, then merging the two files. My problem is that once the program is finished, the new pdf file only has the last page from the original pdf, not all the pages.
eg. If the original pdf has 33 pages, the new pdf only has the last page but with the correct numbering.
Maybe the code can do a better job at explainng:
def test(location, reference, destination):
file = open(location, "rb")
read_pdf = PyPDF2.PdfFileReader(file)
for i in range (0, read_pdf.getNumPages()):
page = read_pdf.getPage(i)
pageReference = "%s_%s"%(reference,format(i+1, '03d'))
width = getPageSizeW(page)
height = getPageSizeH(page)
pagesize = (width, height)
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize = pagesize)
can.setFillColorRGB(1,0,0)
can.drawString(height*3.5, height*2.75, pageReference)
can.save()
packet.seek(0)
new_pdf = PyPDF2.PdfFileReader(packet)
#add new pdf to old pdf
output = PyPDF2.PdfFileWriter()
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
outputStream = open(destination, 'wb')
output.write(outputStream)
print(pageReference)
outputStream.close()
file.close()
def getPageSizeH(p):
h = float(p.mediaBox.getHeight()) * 0.352
return h
def getPageSizeW(p):
w = float(p.mediaBox.getWidth()) * 0.352
return w
Also if anyone has any ideas on how to insert the references on the top right in a better way, it would be appreciated.
I'm not an expert at PyPDF2 but it looks like the only area in your function where you have PyPDF2.PdfFileWriter() is in your for loop, so I suspect you are initiating a new file and adding to it each time in your for loop, which may cause the end result what you see.

How to "write to variable" instead of "to file" in Python

I'm trying to write a function which splits a pdf into separate pages. From this SO answer. I copied a simple function which splits a pdf into separate pages:
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
with open("document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
return pages
This however, writes the new PDFs to file, instead of returning a list of the new PDFs as file variables. So I changed the line of output.write(outputStream) to:
pages.append(outputStream)
When trying to write the elements in the pages list however, I get a ValueError: I/O operation on closed file.
Does anybody know how I can add the new files to the list and return them, instead of writing them to file? All tips are welcome!
It is not completely clear what you mean by "list of PDFs as file variables. If you want to create strings instead of files with PDF contents, and return a list of such strings, replace open() with StringIO and call getvalue() to obtain the contents:
import cStringIO
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
io = cStringIO.StringIO()
output.write(io)
pages.append(io.getvalue())
return pages
You can use the in-memory binary streams in the io module. This will store the pdf files in your memory.
import io
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
outputStream = io.BytesIO()
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
output.write(outputStream)
# Move the stream position to the beginning,
# making it easier for other code to read
outputStream.seek(0)
pages.append(outputStream)
return pages
To later write the objects to a file, use shutil.copyfileobj:
import shutil
with open('page0.pdf', 'wb') as out:
shutil.copyfileobj(pages[0], out)
Haven't used PdfFileWriter, but think that this should work.
def splitPdf(file_):
pdf = PdfFileReader(file_)
pages = []
for i in range(pdf.getNumPages()):
output = PdfFileWriter()
output.addPage(pdf.getPage(i))
pages.append(output)
return pages
def writePdf(pages):
i = 1
for p in pages:
with open("document-page%s.pdf" % i, "wb") as outputStream:
p.write(outputStream)
i += 1

Categories

Resources