I'm using PyPDF2 to alter a PDF document (adding bookmarks). So I need to read in the entire source PDF, and write it out, keeping as much of the data intact as possible. Merely writing each page into a new PDF object may not be sufficient to preserve document metadata.
PdfFileWriter() does have a number of methods for copying an entire file: cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot. However, they all have problems.
If I use cloneDocumentFromReader or appendPagesFromReader, I get a valid PDF file, with the correct number of pages, but all pages are blank.
If I use cloneReaderDocumentRoot, I get a minimal valid PDF file, but with no pages or data.
This has been asked before, but with no successful answers.
Other questions have asked about Blank pages in PyPDF2, but I can't apply the answer given.
Here's my code:
def bookmark(incomingFile):
reader = PdfFileReader(incomingFile)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
#writer.cloneDocumentFromReader(reader)
my_table_of_contents = [
('Page 1', 0),
('Page 2', 1),
('Page 3', 2)
]
# writer.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
for title, pagenum in my_table_of_contents:
writer.addBookmark(title, pagenum, parent=None)
writer.setPageMode("/UseOutlines")
with open(incomingFile, "wb") as fp:
writer.write(fp)
I tend to get errors when PyPDF2 can't add a bookmark to the PdfFileWriter object, because it doesn't have any pages, or similar.
I also wrestled with this a lot, finally found that PyPDF2 has this issue.
Basically I copied this answer's code into C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py (this will depend on your distribution) around line 382 for the cloneDocumentFromReader function.
After that I was able to append the reader pages to the writer with writer.cloneDocumentFromReader(pdf) and, in my case, to update PDF Metadata (Subject, Keywords, etc.).
Hope this helps you
'''
Create a copy (clone) of a document from a PDF file reader
:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:
:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
# Variables used for after cloning the root to
# improve pre- and post- cloning experience
mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)
# If there have already been any number of pages added
if oldPages[NameObject("/Count")] > 0:
# Keep them
mustAddTogether = True
else:
# Through the page object out
if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)
# Clone the reader's root document
self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)
# Sweep for all indirect references
externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
# Delete the stack to reset
del self.stack
#Clean-Up Time!!!
# Get the new root of the PDF
realRoot = self.getObject(newRootRef)
# Get the new pages tree root and its ID Number
tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)
# Make an IndirectObject just for the new Pages
self._pages = IndirectObject(newIdNumForPages, 0, self)
# If there are any pages to add back in
if mustAddTogether:
# Set the new page's root's parent to the old
# page's root's reference
tmpPages[NameObject("/Parent")] = oldPagesRef
# Add the reference to the new page's root in
# the old page's kids array
newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)
# Set all references to the root of the old/new
# page's root
self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef
# Update the count attribute of the page's root
oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
else:
# Bump up the info's reference b/c the old
# page's tree was bumped off
self._info = newInfoRef
Related
I'm building a custom sphinx extension and Directive to render interactive charts on sphinx and ReadTheDocs. The actual data for a chart resides in a .json file.
In a .rst document, including a chart looks like this:
.. chart:: charts/test.json
:width: 400px
:height: 250px
This is the caption of the chart, yay...
The steps I take are:
Render the documentation with a placeholder the same size as the chart (a loading spinner)
Use jQuery (in an added javascript file) to get the URI (which I add as an attribute to the dom node in my Directive) of the json file (in this case charts/test.json)
Use jQuery to fetch the file and parse the JSON
On successful fetch of the data, use the plotly library to render it into a chart and use jQuery to remove the placeholder
The directive looks like:
class PlotlyChartDirective(Directive):
""" Top-level plotly chart directive """
has_content = True
def px_value(argument):
# This is not callable as self.align. We cannot make it a
# staticmethod because we're saving an unbound method in
# option_spec below.
return directives.length_or_percentage_or_unitless(argument, 'px')
required_arguments = 1
optional_arguments = 0
option_spec = {
# TODO allow static images for PDF renders 'altimage': directives.unchanged,
'height': px_value,
'width': px_value,
}
def run(self):
""" Parse a plotly chart directive """
self.assert_has_content()
env = self.state.document.settings.env
# Ensure the current chart ID is initialised in the environment
if 'next_plotly_chart_id' not in env.temp_data:
env.temp_data['next_plotly_chart_id'] = 0
id = env.temp_data['next_plotly_chart_id']
# Handle the URI of the *.json asset
uri = directives.uri(self.arguments[0])
# Create the main node container and store the URI of the file which will be collected later
node = nodes.container()
node['classes'] = ['sphinx-plotly']
# Increment the ID counter ready for the next chart
env.temp_data['next_plotly_chart_id'] += 1
# Only if its a supported builder do we proceed (otherwise return an empty node)
if env.app.builder.name in get_compatible_builders(env.app):
chart_node = nodes.container()
chart_node['classes'] = ['sphinx-plotly-chart', f"sphinx-plotly-chart-id-{id}", f"sphinx-plotly-chart-uri-{uri}"]
placeholder_node = nodes.container()
placeholder_node['classes'] = ['sphinx-plotly-placeholder', f"sphinx-plotly-placeholder-{id}"]
placeholder_node += nodes.caption('', 'Loading...')
node += chart_node
node += placeholder_node
# Add optional chart caption and legend (inspired by Figure directive)
if self.content:
caption_node = nodes.Element() # Anonymous container for parsing
self.state.nested_parse(self.content, self.content_offset, caption_node)
first_node = caption_node[0]
if isinstance(first_node, nodes.paragraph):
caption = nodes.caption(first_node.rawsource, '', *first_node.children)
caption.source = first_node.source
caption.line = first_node.line
node += caption
elif not (isinstance(first_node, nodes.comment) and len(first_node) == 0):
error = self.state_machine.reporter.error(
'Chart caption must be a paragraph or empty comment.',
nodes.literal_block(self.block_text, self.block_text),
line=self.lineno)
return [node, error]
if len(caption_node) > 1:
node += nodes.legend('', *caption_node[1:])
return [node]
No matter where I look in the source code of the Figure and Image directives (on which I'm basing this) I can't figure out how to copy the acutal image from its input location, into the static folder in the build directory.
Without copying the *.json file specified in the argument to my directive, I always get a file not found!
I've tried hard to find supporting methods (I assumed that the Sphinx app instance would have an add_static_file() method just like it has an add_css_file() method).
I've also tried adding a list of files to copy to the app instance, then copying assets at the end of the build (but am being thwarted because you can't add attributes to the Sphinx class)
QUESTION IN A NUTSHELL
In a custom directive, how do you copy an asset (whose path is specified by an argument to the directive) to the build's _static directory?
I realised it was straightforward to use sphinx's copyfile utility (which only copies if a file is changed, so is quick) right within the directive's run() method.
See how I get the src_uri and build_uri and copy the file directly within this updated Directive:
class PlotlyChartDirective(Directive):
""" Top-level plotly chart directive """
has_content = True
def px_value(argument):
# This is not callable as self.align. We cannot make it a
# staticmethod because we're saving an unbound method in
# option_spec below.
return directives.length_or_percentage_or_unitless(argument, 'px')
required_arguments = 1
optional_arguments = 0
option_spec = {
# TODO allow static images for PDF renders 'altimage': directives.unchanged,
'height': px_value,
'width': px_value,
}
def run(self):
""" Parse a plotly chart directive """
self.assert_has_content()
env = self.state.document.settings.env
# Ensure the current chart ID is initialised in the environment
if 'next_plotly_chart_id' not in env.temp_data:
env.temp_data['next_plotly_chart_id'] = 0
# Get the ID of this chart
id = env.temp_data['next_plotly_chart_id']
# Handle the src and destination URI of the *.json asset
uri = directives.uri(self.arguments[0])
src_uri = os.path.join(env.app.builder.srcdir, uri)
build_uri = os.path.join(env.app.builder.outdir, '_static', uri)
# Create the main node container and store the URI of the file which will be collected later
node = nodes.container()
node['classes'] = ['sphinx-plotly']
# Increment the ID counter ready for the next chart
env.temp_data['next_plotly_chart_id'] += 1
# Only if its a supported builder do we proceed (otherwise return an empty node)
if env.app.builder.name in get_compatible_builders(env.app):
# Make the directories and copy file (if file has changed)
destdir = os.path.dirname(build_uri)
if not os.path.exists(destdir):
os.makedirs(destdir)
copyfile(src_uri, build_uri)
width = self.options.pop('width', DEFAULT_WIDTH)
height = self.options.pop('height', DEFAULT_HEIGHT)
chart_node = nodes.container()
chart_node['classes'] = ['sphinx-plotly-chart', f"sphinx-plotly-chart-id-{id}", f"sphinx-plotly-chart-uri-{uri}"]
placeholder_node = nodes.container()
placeholder_node['classes'] = ['sphinx-plotly-placeholder', f"sphinx-plotly-placeholder-{id}"]
placeholder_node += nodes.caption('', 'Loading...')
node += chart_node
node += placeholder_node
# Add optional chart caption and legend (as per figure directive)
if self.content:
caption_node = nodes.Element() # Anonymous container for parsing
self.state.nested_parse(self.content, self.content_offset, caption_node)
first_node = caption_node[0]
if isinstance(first_node, nodes.paragraph):
caption = nodes.caption(first_node.rawsource, '', *first_node.children)
caption.source = first_node.source
caption.line = first_node.line
node += caption
elif not (isinstance(first_node, nodes.comment) and len(first_node) == 0):
error = self.state_machine.reporter.error(
'Chart caption must be a paragraph or empty comment.',
nodes.literal_block(self.block_text, self.block_text),
line=self.lineno)
return [node, error]
if len(caption_node) > 1:
node += nodes.legend('', *caption_node[1:])
return [node]
Is it possible in python to pretty print the root's attributes?
I used etree to extend the attributes of the child tag and then I had overwritten the existing file with the new content. However during the first generation of the XML, we were using a template where the attributes of the root tag were listed one per line and now with the etree I don't manage to achieve the same result.
I found similar questions but they were all referring to the tutorial of etree, which I find incomplete.
Hopefully someone has found a solution for this using etree.
EDIT: This is for custom XML so HTML Tidy (which was proposed in the comments), doesn't work for this.
Thanks!
generated_descriptors = list_generated_files(generated_descriptors_folder)
counter = 0
for g in generated_descriptors:
if counter % 20 == 0:
print "Extending Descriptor # %s out of %s" % (counter, len(descriptor_attributes))
with open(generated_descriptors_folder + "\\" + g, 'r+b') as descriptor:
root = etree.XML(descriptor.read(), parser=parser)
# Go through every ContextObject to check if the block is mandatory
for context_object in root.findall('ContextObject'):
for attribs in descriptor_attributes:
if attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'false')
elif attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] not in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'true')
# Sort the ContextObjects based on allow-null and their name
context_objects = root.findall('ContextObject')
context_objects_sorted = sorted(context_objects, key=lambda c: (c.attrib['allow-null'], c.attrib['name']))
root[:] = context_objects_sorted
# Remove mandatoryobjects from Descriptor attributes and pretty print
root.attrib.pop("mandatoryobjects", None)
# paste new line here
# Convert to string in order to write the enhanced descriptor
xml = etree.tostring(root, pretty_print=True, encoding="UTF-8", xml_declaration=True)
# Write the enhanced descriptor
descriptor.seek(0) # Set cursor at beginning of the file
descriptor.truncate(0) # Make sure that file is empty
descriptor.write(xml)
descriptor.close()
counter+=1
Consider a reStructuredText document with this skeleton:
Main Title
==========
text text text text text
Subsection
----------
text text text text text
.. my-import-from:: file1
.. my-import-from:: file2
The my-import-from directive is provided by a document-specific Sphinx extension, which is supposed to read the file provided as its argument, parse reST embedded in it, and inject the result as a section in the current input file. (Like autodoc, but for a different file format.) The code I have for that, right now, looks like this:
class MyImportFromDirective(Directive):
required_arguments = 1
def run(self):
src, srcline = self.state_machine.get_source_and_line()
doc_file = os.path.normpath(os.path.join(os.path.dirname(src),
self.arguments[0]))
self.state.document.settings.record_dependencies.add(doc_file)
doc_text = ViewList()
try:
doc_text = extract_doc_from_file(doc_file)
except EnvironmentError as e:
raise self.error(e.filename + ": " + e.strerror) from e
doc_section = nodes.section()
doc_section.document = self.state.document
# report line numbers within the nested parse correctly
old_reporter = self.state.memo.reporter
self.state.memo.reporter = AutodocReporter(doc_text,
self.state.memo.reporter)
nested_parse_with_titles(self.state, doc_text, doc_section)
self.state.memo.reporter = old_reporter
if len(doc_section) == 1 and isinstance(doc_section[0], nodes.section):
doc_section = doc_section[0]
# If there was no title, synthesize one from the name of the file.
if len(doc_section) == 0 or not isinstance(doc_section[0], nodes.title):
doc_title = nodes.title()
doc_title.append(make_title_text(doc_file))
doc_section.insert(0, doc_title)
return [doc_section]
This works, except that the new section is injected as a child of the current section, rather than a sibling. In other words, the example document above produces a TOC tree like this:
Main Title
Subsection
File1
File2
instead of the desired
Main Title
Subsection
File1
File2
How do I fix this? The Docutils documentation is ... inadequate, particularly regarding control of section depth. One obvious thing I have tried is returning doc_section.children instead of [doc_section]; that completely removes File1 and File2 from the TOC tree (but does make the section headers in the body of the document appear to be for the right nesting level).
I don't think it is possible to do this by returning the section from the directive (without doing something along the lines of what Florian suggested), as it will get appended to the 'current' section. You can, however, add the section via self.state.section as I do in the following (handling of options removed for brevity)
class FauxHeading(object):
"""
A heading level that is not defined by a string. We need this to work with
the mechanics of
:py:meth:`docutils.parsers.rst.states.RSTState.check_subsection`.
The important thing is that the length can vary, but it must be equal to
any other instance of FauxHeading.
"""
def __init__(self, length):
self.length = length
def __len__(self):
return self.length
def __eq__(self, other):
return isinstance(other, FauxHeading)
class ParmDirective(Directive):
required_arguments = 1
optional_arguments = 0
has_content = True
option_spec = {
'type': directives.unchanged,
'precision': directives.nonnegative_int,
'scale': directives.nonnegative_int,
'length': directives.nonnegative_int}
def run(self):
variableName = self.arguments[0]
lineno = self.state_machine.abs_line_number()
secBody = None
block_length = 0
# added for some space
lineBlock = nodes.line('', '', nodes.line_block())
# parse the body of the directive
if self.has_content and len(self.content):
secBody = nodes.container()
block_length += nested_parse_with_titles(
self.state, self.content, secBody)
# keeping track of the level seems to be required if we want to allow
# nested content. Not sure why, but fits with the pattern in
# :py:meth:`docutils.parsers.rst.states.RSTState.new_subsection`
myLevel = self.state.memo.section_level
self.state.section(
variableName,
'',
FauxHeading(2 + len(self.options) + block_length),
lineno,
[lineBlock] if secBody is None else [lineBlock, secBody])
self.state.memo.section_level = myLevel
return []
I don't know how to do it directly inside your custom directive. However, you can use a custom transform to raise the File1 and File2 nodes in the tree after parsing. For example, see the transforms in the docutils.transforms.frontmatter module.
In your Sphinx extension, use the Sphinx.add_transform method to register the custom transform.
Update: You can also directly register the transform in your directive by returning one or more instances of the docutils.nodes.pending class in your node list. Make sure to call the note_pending method of the document in that case (in your directive you can get the document via self.state_machine.document).
def analysis_report(request):
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment;filename=ANALYSIS_REPORT.pdf'
buffer = StringIO()
doc = SimpleDocTemplate(buffer)
doc.sample_no = 12345
document = []
doc.build(document, onLaterPages=header_footer)
def header_footer(canvas, doc):
canvas.saveState()
canvas.setFont("Times-Bold", 11)
canvas.setFillColor(gray)
canvas.setStrokeColor('#5B80B2')
canvas.drawCentredString(310, 800, 'HEADER ONE GOES HERE')
canvas.drawString(440, 780, 'Sample No: %s' %doc.sample_no)
canvas.setFont('Times-Roman', 5)
canvas.drawString(565, 4, "Page %d" % doc.page)
I above code i can bale to display the page number, but my question is how can i display "Page X of Y" where Y is page count and X is current page.
I followed this http://code.activestate.com/recipes/546511-page-x-of-y-with-reportlab/, but they explained using canvasmaker, where as i'm using OnlaterPages argument in build.
How can i achieve the above thing using canvasmaker or is there any solution using OnLaterPages ?
Here is the improved recipe http://code.activestate.com/recipes/576832/ which should work with images.
Another possible workaround would be to use pyPDF (or any other pdf-lib with the funcionality) to read the total number of pages after doc.build() and then rebuild the story with the gathered information by exchanging the corresponding Paragraph()'s. This approach might be more hackish, but does the trick with no subclassing.
Example:
from pyPdf import PdfFileReader
[...]
story.append(Paragraph('temp paragraph. this will be exchanged with the total page number'))
post_story = story[:] #copy the story because build consumes it
doc.build(story) #build the pdf with name temp.pdf
temp_pdf = PdfFileReader(file("temp.pdf", "rb"))
total_pages = cert_temp.getNumPages()
post_story[-1] = Paragraph('total pages: ' + str(total_pages))
doc.build(post_story)
i would like to use pyPdf to split a pdf file based on the outline where each destination in the outline refers to a different page within the pdf.
example outline:
main --> points to page 1
sect1 --> points to page 1
sect2 --> points to page 15
sect3 --> points to page 22
it is easy within pyPdf to iterate over each page of the document or each destination in the document's outline; however, i cannot figure out how to get the page number where the destination points.
does anybody know how to find the referencing page number for each destination in the outline?
I figured it out:
class Darrell(pyPdf.PdfFileReader):
def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, pyPdf.pdf.Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result
def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result
pdf = Darrell(open(PATH-TO-PDF, 'rb'))
template = '%-5s %s'
print template % ('page', 'title')
for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]):
print template % (p+1,t)
This is just what I was looking for. Darrell's additions to PdfFileReader should be part of PyPDF2.
I wrote a little recipe that uses PyPDF2 and sejda-console to split a PDF by bookmarks. In my case there are several Level 1 sections that I want to keep together. This script allows me to do that and give the resulting files meaningful names.
import operator
import os
import subprocess
import sys
import time
import PyPDF2 as pyPdf
# need to have sejda-console installed
# change this to point to your installation
sejda = 'C:\\sejda-console-1.0.0.M2\\bin\\sejda-console.bat'
class Darrell(pyPdf.PdfFileReader):
...
if __name__ == '__main__':
t0= time.time()
# get the name of the file to split as a command line arg
pdfname = sys.argv[1]
# open up the pdf
pdf = Darrell(open(pdfname, 'rb'))
# build list of (pagenumbers, newFileNames)
splitlist = [(1,'FrontMatter')] # Customize name of first section
template = '%-5s %s'
print template % ('Page', 'Title')
print '-'*72
for t,p in sorted(pdf.getDestinationPageNumbers().iteritems(),
key=operator.itemgetter(1)):
# Customize this to get it to split where you want
if t.startswith('Chapter') or \
t.startswith('Preface') or \
t.startswith('References'):
print template % (p+1, t)
# this customizes how files are renamed
new = t.replace('Chapter ', 'Chapter')\
.replace(': ', '-')\
.replace(': ', '-')\
.replace(' ', '_')
splitlist.append((p+1, new))
# call sejda tools and split document
call = sejda
call += ' splitbypages'
call += ' -f "%s"'%pdfname
call += ' -o ./'
call += ' -n '
call += ' '.join([str(p) for p,t in splitlist[1:]])
print '\n', call
subprocess.call(call)
print '\nsejda-console has completed.\n\n'
# rename the split files
for p,t in splitlist:
old ='./%i_'%p + pdfname
new = './' + t + '.pdf'
print 'renaming "%s"\n to "%s"...'%(old, new),
try:
os.remove(new)
except OSError:
pass
try:
os.rename(old, new)
print' succeeded.\n'
except:
print' failed.\n'
print '\ndone. Spliting took %.2f seconds'%(time.time() - t0)
Small update to #darrell class to be able to parse UTF-8 outlines, which I post as answer because comment would be hard to read.
Problem is in pyPdf.pdf.Destination.title which may be returned in two flavors:
pyPdf.generic.TextStringObject
pyPdf.generic.ByteStringObject
so that output from _setup_outline_page_ids() function returns also two different types for title object, which fails with UnicodeDecodeError if outline title contains anything then ASCII.
I added this code to solve the problem:
if isinstance(title, pyPdf.generic.TextStringObject):
title = title.encode('utf-8')
of whole class:
class PdfOutline(pyPdf.PdfFileReader):
def getDestinationPageNumbers(self):
def _setup_outline_page_ids(outline, _result=None):
if _result is None:
_result = {}
for obj in outline:
if isinstance(obj, pyPdf.pdf.Destination):
_result[(id(obj), obj.title)] = obj.page.idnum
elif isinstance(obj, list):
_setup_outline_page_ids(obj, _result)
return _result
def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None):
if _result is None:
_result = {}
if pages is None:
_num_pages = []
pages = self.trailer["/Root"].getObject()["/Pages"].getObject()
t = pages["/Type"]
if t == "/Pages":
for page in pages["/Kids"]:
_result[page.idnum] = len(_num_pages)
_setup_page_id_to_num(page.getObject(), _result, _num_pages)
elif t == "/Page":
_num_pages.append(1)
return _result
outline_page_ids = _setup_outline_page_ids(self.getOutlines())
page_id_to_page_numbers = _setup_page_id_to_num()
result = {}
for (_, title), page_idnum in outline_page_ids.iteritems():
if isinstance(title, pyPdf.generic.TextStringObject):
title = title.encode('utf-8')
result[title] = page_id_to_page_numbers.get(page_idnum, '???')
return result
Darrell's class can be modified slightly to produce a multi-level table of contents for a pdf (in the manner of pdftoc in the pdftk toolkit.)
My modification adds one more parameter to _setup_page_id_to_num, an integer "level" which defaults to 1. Each invocation increments the level. Instead of storing just the page number in the result, we store the pair of page number and level. Appropriate modifications should be applied when using the returned result.
I am using this to implement the "PDF Hacks" browser-based page-at-a-time document viewer with a sidebar table of contents which reflects LaTeX section, subsection etc bookmarks. I am working on a shared system where pdftk can not be installed but where python is available.
A solution 10 years later for newer python and PyPDF:
from PyPDF2 import PdfReader, PdfWriter
filename = "main.pdf"
with open(filename, "rb") as f:
r = PdfReader(f)
bookmarks = list(map(lambda x: (x.title, r.get_destination_page_number(x)), r.outline))
print(bookmarks)
for i, b in enumerate(bookmarks):
begin = b[1]
end = bookmarks[i+1][1] if i < len(bookmarks) - 1 else len(r.pages)
# print(len(r.pages[begin:end]))
name = b[0] + ".pdf"
print(f"{name=}: {begin=}, {end=}")
with open(name, "wb") as f:
w = PdfWriter(f)
for p in r.pages[begin:end]:
w.add_page(p)
w.write(f)