Create outlines/TOC for existing PDF in Python

Create outlines/TOC for existing PDF in Python - python

I'm using pyPdf to merge several PDF files into one. This works great, but I would also need to add a table of contents/outlines/bookmarks to the PDF file that is generated.
pyPdf seems to have only read support for outlines. Reportlab would allow me to create them, but the opensource version does not support loading PDF files, so that doesn't work to add outlines to an existing file.
Is there any way I can add outlines to an existing PDF using Python, or any library that would allow that?

https://github.com/yutayamamoto/pdfoutline
I made a python library just for adding an outline to an existing PDF file.

It looks like pypdf can do the job. See the add_outline_item method in the documentation.

We had a similar problem in WeasyPrint: cairo produces the PDF files but does not support bookmarks/outlines or hyperlinks. In the end we bit the bullet, read the PDF spec, and did it ourselves.
WeasyPrint’s pdf.py has a simple PDF parser and writer that can add/override PDF "objects" to an existing documents. It uses the PDF "update" mechanism and only append at the end of the file.
This module was made for internal use only but I’m open to refactoring it to make it easier to use in other projects.
However the parser takes a few shortcuts and can not parse all valid PDF files. It may need to be adapted if PyPDF’s output is not as nice as cairo’s. From the module’s docstring:
Rather than trying to parse any valid PDF, we make some assumptions
that hold for cairo in order to simplify the code:
All newlines are '\n', not '\r' or '\r\n'
Except for number 0 (which is always free) there is no "free" object.
Most white space separators are made of a single 0x20 space.
Indirect dictionary objects do not contain '>>' at the start of a line except to mark the end of the object, followed by 'endobj'. (In
other words, '>>' markers for sub-dictionaries are indented.)
The Page Tree is flat: all kids of the root page node are page objects, not page tree nodes.

Related

Extract Geometry Elements from PDF by OCG (by Layer)

So I've spent the good majority of a month on this issue. I'm looking for a way to extract geometry elements (polylines, text, arcs, etc.) from a vectorized PDF organised by the file's OCGs (Optional Content Groups), which are basically PDF layers. Using PDFminer I was able to extract geometry (LTCurves, LTTextBoxes, LTLines, etc.); using PyPDF2, I was able to view how many OCGs were in the PDF, though I was not able to access geometry associated with that OCG. There were a few hacky scripts I've seen and tried online that may have been able to solve this problem, but to no avail. I even resorted to opening the raw PDF data in a text editor and half hazardly removing parts of it to see if I could come up with some custom parsing technique to do this, but again to no avail. Adobe's PDF manual is minimal at best, so that was no help when I was attempting to create a parser. Does anyone know a solution to this.
At this point, I'm open to a solution in any language, using any OS (though I would prefer a solution using Python 3 on Windows or Linux), as long as it is open source / free.
Can anyone here help end this rabbit hole of darkness? Much appreciated!

A PDF document consists of two "types" of data. There is an object oriented "structure" to the document to divide it into pages, and carry meta data (like, for instance, there is this list of Optional Content Groups), and there is a stream oriented list of marking operators that actually "draw" content onto the page.
The fact that there are OCG's, and their names, and a bit about them is stored on object oriented content, and can be extracted by parsing the object content fairly easily. But the membership of the OCG's is NOT stored in the object structure. It can only be found by parsing the Content Stream. A group of marking operators is a member of a particular OCG group when it is preceeded by the content operator /OC /optionacontentgroupname BDC and followed by the operator EMC.
Parsing a content stream is a less than trivial task. There are many tools out there that will do this for you. I would not, myself, attempt to build such a parser from scratch. There is little value in re-writing the wheel.
The complete syntax of PDF is available from many sources. Search the web for "PDF Specification 1.7", or "ISO32000-1:2008". It is a daunting document to read, but it does supply all of the information needed to create both and object and a content parser

If your PDF is organized in OGC layers, then you can use gdal_translate command of GDAL.
Use the following command to check all available OGC layers in your PDF file:
gdalinfo "sample.pdf" -mdd LAYERS
Then, use the following to command to extract the partiular layer:
gdal_translate "sample.pdf" -of PNG sample.png --config GDAL_PDF_LAYERS "your_specific_layer_name"
More details are mentioned here.

Hey #pythonic_programmer, I am able to use this python library pdflayers to disable the default view (visible/not visible) of the layer into the new pdf file.
https://pypi.org/project/pdflayers/
Pretty much it means disable the default state of the layer
in the pdf file: https://helpx.adobe.com/acrobat/using/pdf-layers.html
Any layer not visible meaning that layer will not render to the pdf document when you process (by default).

Python -- Parsing files (docx, pdf and odt) and converting the content into my data model

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.
We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.
So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.
The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:
Book --> o2m --> Chapter --> o2m --> Image
Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.
Even more problematic is that with docx/odt; those are just paragraphs -- <\w:sdt> -- elements, with attrs and child elements.
I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.

Textract is the best tool that i have encountered so far for parsing different file formats.
It can parse most of the file formats.
You can find the project on Github
Here is the official documentation

(Python 3 answer)
When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/
What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.
I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:
import zipfile
try:
from xml.etree.cElementTree import XML
print("cElementTree")
except ImportError:
from xml.etree.ElementTree import XML
print("ElementTree")
EDIT:
If, in the body of the function, you replace
'word/document.xml'
with
'word/footnotes.xml'
or
'word/endnotes.xml'
you can get the footnotes and endnotes, respectively.
The markers for where they were in the text are lost, however.

Reading Font Colour Information From a PDF

I am working on a piece of software that analyses PDF files and generates HTML based on them. There are a number of things out there that already do this so I know it is possible, I have to write my own for business reasons.
I have managed to get all the text information, positions, fonts out of the PDF but I am struggling to read out the colour of the text. I am currently using PDFMiner to analyse the PDF but am beginning to think I will need to write my own PDFReader, even so, I can't figure out where in the document the Colour information for text is even kept! I have even read the PDF spec but cannot find the information I need.
I have scoured google, with no joy.
Thanks in advance!

The colour for text and other filled graphics is set using one of the g, rg or k operators in the content stream object in the PDF file, as described in section 4.5.7 Color Operators in the PDF reference manual.
The example G.3 Simple Graphics Example in the reference manual shows these operators being used to stroke and fill some shapes (but not text).
http://www.adobe.com/devnet/pdf/pdf_reference.html
When parsing a PDF file yourself you start by reading the trailer
at the end of the file which contains the file offset of the
cross reference table. This table contains the file offset of
each object in the PDF file. The objects are in a tree structure with references
to other objects. One of the objects will be
the content stream. This is described in sections 3.4 File Structure
and 3.6 Document Structure in the PDF reference manual.
It is possible to parse the PDF file yourself but this is
quite a lot of work. The content
stream may be compressed, contain references to other objects,
contain comments, etc. and you must handle all of these cases.
The PDFMiner software is already reading the content stream. Perhaps it
would be easier to extend PDFMiner to report the colour
of the text too?

Generate ODT/DOC(X) and convert to PDF, without OO.o/MS

I have a WSGI application that generates invoices and stores them as PDF.
So far I have solved similar problems with FPDF (or equivalents), generating the PDF from scratch like a GUI. Sadly this means the entire formatting logic (positioning headers, footers and content, styling) is in the application, where it really shouldn't be.
As the templates already exist in Office formats (ODT, DOC, DOCX), I would prefer to simply use those as a basis and fill in the actual content. I've found the Appy framework, which does pretty much that with annotated ODT files.
That still leaves the bigger problem open, tho: converting ODT (or DOC, or DOCX) to PDF. On a server. Running Linux. Without GUI libraries. And thus, without OO.o or MS Office.
Is this at all possible or am I better off keeping the styling in my code?
The actual content that would be filled in is actually quite restricted: a few paragraphs, some of which may be optional, a headline or two, always at the same place, and a few rows of a table. In HTML this would be trivial.
EDIT: Basically, I want a library that can generate ODT files from ODF files acting as templates and a library that can convert the result into PDF (which is probably the crux).

I don't know how to go about automatic ODT -> PDF conversion, but a simpler route might be to generate your invoices as HTML and convert them to PDF using http://www.xhtml2pdf.com/. I haven't tried the library myself, but it definitely seems promising.

You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
If not sure the difference between odt and odf in this case. I checked the file type and it said 'application/vnd.oasis.opendocument.text'. So I assume it is odt. You can print to a pdf file by using QPrinter.
More information at:
http://qt-project.org/doc/qt-4.8/

Formatted output in OpenOffice/Microsoft Word with Python

I am working on a project (in Python) that needs formatted, editable output. Since the end-user isn't going to be technically proficient, the output needs to be in a word processor editable format. The formatting is complex (bullet points, paragraphs, bold face, etc).
Is there a way to generate such a report using Python? I feel like there should be a way to do this using Microsoft Word/OpenOffice templates and Python, but I can't find anything advanced enough to get good formatting. Any suggestions?

A little known, and slightly evil fact: If you create an HTML file, and stick a .doc extension on it, Word will open it as a Word document, and most users will be none the wiser.
Except maybe a very technical person will say, my this is a small Word file! :)

Use the Python Docx module for this - 100% Python, tables, images, document properties, headings, paragraphs, and more.

" The formatting is complex(bullet points, paragraphs, bold face, etc), "
Use RST.
It's trivial to produce, since it's plain text.
It's trivial to edit, since it's plain text with a few extra characters to provide structural information.
It formats nicely using a bunch of tools.

I know there is an odtwriter for docutils. You could generate your output as reStructuredText and feed it to odtwriter or look into what odtwriter is using on the backend to generate the ODT and use that.
(I'd probably go with generating rst output and then hacking odtwriter to output the stuff I want (and contribute the fixes back to the project), because that's probably a whole lot easier that trying to render your stuff to ODT directly.)

I've used xlwt to create Excel documents using python, but I haven't needed to write word files yet. I've found this package, OOoPy, but I haven't used it.
Also you might want to try outputting html files and having the users open them in Word.

You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
QTextCursor also can insert tables, frames, blocks, images. More information at:
http://qt-project.org/doc/qt-4.8/qtextcursor.html
As a bonus, you also can print to a pdf file by using QPrinter.

I think OpenOffice has some Python bindings - you should be able to write OO macros in Python.
But I would use HTML instead - Word and OO.org are rather good at editing it and you can write it from Python easily (although Word saves a lot of mess which could complicate parsing it by your Python app).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create outlines/TOC for existing PDF in Python - python

https://github.com/yutayamamoto/pdfoutline I made a python library just for adding an outline to an existing PDF file.

It looks like pypdf can do the job. See the add_outline_item method in the documentation.

Related

Extract Geometry Elements from PDF by OCG (by Layer)

Python -- Parsing files (docx, pdf and odt) and converting the content into my data model

Reading Font Colour Information From a PDF

Generate ODT/DOC(X) and convert to PDF, without OO.o/MS

Formatted output in OpenOffice/Microsoft Word with Python

Categories

Resources