Extract Geometry Elements from PDF by OCG (by Layer)

Extract Geometry Elements from PDF by OCG (by Layer) - python

So I've spent the good majority of a month on this issue. I'm looking for a way to extract geometry elements (polylines, text, arcs, etc.) from a vectorized PDF organised by the file's OCGs (Optional Content Groups), which are basically PDF layers. Using PDFminer I was able to extract geometry (LTCurves, LTTextBoxes, LTLines, etc.); using PyPDF2, I was able to view how many OCGs were in the PDF, though I was not able to access geometry associated with that OCG. There were a few hacky scripts I've seen and tried online that may have been able to solve this problem, but to no avail. I even resorted to opening the raw PDF data in a text editor and half hazardly removing parts of it to see if I could come up with some custom parsing technique to do this, but again to no avail. Adobe's PDF manual is minimal at best, so that was no help when I was attempting to create a parser. Does anyone know a solution to this.
At this point, I'm open to a solution in any language, using any OS (though I would prefer a solution using Python 3 on Windows or Linux), as long as it is open source / free.
Can anyone here help end this rabbit hole of darkness? Much appreciated!

A PDF document consists of two "types" of data. There is an object oriented "structure" to the document to divide it into pages, and carry meta data (like, for instance, there is this list of Optional Content Groups), and there is a stream oriented list of marking operators that actually "draw" content onto the page.
The fact that there are OCG's, and their names, and a bit about them is stored on object oriented content, and can be extracted by parsing the object content fairly easily. But the membership of the OCG's is NOT stored in the object structure. It can only be found by parsing the Content Stream. A group of marking operators is a member of a particular OCG group when it is preceeded by the content operator /OC /optionacontentgroupname BDC and followed by the operator EMC.
Parsing a content stream is a less than trivial task. There are many tools out there that will do this for you. I would not, myself, attempt to build such a parser from scratch. There is little value in re-writing the wheel.
The complete syntax of PDF is available from many sources. Search the web for "PDF Specification 1.7", or "ISO32000-1:2008". It is a daunting document to read, but it does supply all of the information needed to create both and object and a content parser

If your PDF is organized in OGC layers, then you can use gdal_translate command of GDAL.
Use the following command to check all available OGC layers in your PDF file:
gdalinfo "sample.pdf" -mdd LAYERS
Then, use the following to command to extract the partiular layer:
gdal_translate "sample.pdf" -of PNG sample.png --config GDAL_PDF_LAYERS "your_specific_layer_name"
More details are mentioned here.

Hey #pythonic_programmer, I am able to use this python library pdflayers to disable the default view (visible/not visible) of the layer into the new pdf file.
https://pypi.org/project/pdflayers/
Pretty much it means disable the default state of the layer
in the pdf file: https://helpx.adobe.com/acrobat/using/pdf-layers.html
Any layer not visible meaning that layer will not render to the pdf document when you process (by default).

Related

Can I tag a PDF programmatically?

Can an unstructured PDF be tagged using any tools/libraries?
Only source of tagging a PDF was using Adobe Acrobat or Auto-Tag APIs (Not something which I am looking forward to + not so great results imo)
I know the bounding boxes and semantics of the elements (i.e paragraph, lists, headings, tables)
So, is there a way to manipulate PDF trees/objects? preferably in Python or JavaScript.
Any thoughts on the topic is appreciated!!
PDF spec Talks about "StructTreeRoot" for Tagged PDFs. Going deep inside for making these objects would be
nerve-racking, so is there any high-level library to manipulate objects?

Tagging a PDF with all that entails needs to be done by the PDF writer so here is this page as Tagged by Chromium/Foxit/Skia in MS Edge.
Consider how impossible this may be to do retrospectively word by word or even sentence or paragraph at a time, as PDF does not inherently have such constructions.
Things like H1 are discarded by the paper printout generator as unrequired superfluous bloat for a printer.
OK the prime reason for tagging is the human challenged reader, so with a tagged PDF lets see how it fares. Here we are only dealing with one simple page without images or tables (the two most common reasons for checking tags)
So programmatically how will an iterative application driven by Python resolve the residual requirements which are missing.
Language, as a Human I know the language is English (that should have been obvious to a browser that speaks aloud)
The Title is missing but again that should be obvious is "TAGGING PDFS" suitable as a working title for approval by another person?
Lets temporarily ignore the major errors that tagging and order of tabs is wrong. A human with eyes and brain to analyse why, can fix those, as they progress through all the pages human aspects, so can the "Human" read / navigate logically? will itself resolve the tags order, and at the same time, check if the page is visually suitable contrast for visually challenged.
So the tagging of a PDF is best done at the time a human completes their retrospective use of the page, and that is best done using "Pre-flight" "Post-flight" GUI applications, such as Acrobat.

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.

The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Finding and identifying streams in PDF using python

I've been trying for about a week to automate image extraction from a pdf. Unfortunately, the answers I found here were of no help. I've seen multiple variations on the same code using pypdf2, all with ['/XObject'] in them, which results in a KeyError.
What I'm looking for seems to be hiding in streams, which I can't find in pypdf2's dictionary (even after recursively exploring the whole structure, calling .getObject() on every indirect object I can find).
Using pypdf2 I've written one page off the pdf and opened it using Notepad++, to find some streams with the /FlateDecode filter.
pdfrw was slightly more helpful, allowing me to use PdfReader(path).pages[page].Contents.stream to get A stream (no clue how to get the others).
Using zlib, I decompressed it, and got something starting with:
/Part <</MCID 0 >>BDC
(It also contains a lot of floating-point numbers, both positive and negative)
From what I could find, BDC has something to do with ghostscript.
At this point I gave up and decided to ask for help.
Is there a python tool to, at least, extract all streams (and identify FlateDecode tag?)
And is there a way for me to identify what's hidden in there? I expected the start tag of some image format, which this clearly isn't. How do I further parse this result to find any image that could be hidden in there?
I'm looking for something I can apply to any PDF that's displayed properly. Some tool to further parse, or at least help me make sense of the streams, or even a reference that will help me understand what's going on.
Edit: it seems, as noted by Patrick, that I was barking up the wrong tree. I went to streams since I couldn't find any xObjects when opening the PDF in Notepad++, or when running the various python scripts used to parse PDFs. I managed to find what I suspect are the images, with no xObject tags, but with what seems like a stream tag - though the information is not compressed.

Unless you are looking to extract inline images, which aren't that common, the content stream is not the place to look for images. The more common case are Streams of type XObject, of subtype Image, which are usually found in a page's Resource->XObject dictionary (see sections 7.3.3, 7.8.3, and 8.95 of the PDF Reference indicated by #mkl).
Alternately, Image XObjects can also be found in Form XObjects (subtype Form, which indicates they have their own content streams) in their own Resource->XObject dictionary, so the search for Image XObjects can be recursive.
An Image XObject can also have a softMask, which is itself its own Image XObject. Form XObjects are also used in Tiling Patterns, and so could conceivably contain Image XObjects (but they aren't that common either), or used in an Annotation's Normal Appearance (but Image XObjects are less commonly used within such Annotations, except maybe 3D or multimedia annotations).

Create outlines/TOC for existing PDF in Python

I'm using pyPdf to merge several PDF files into one. This works great, but I would also need to add a table of contents/outlines/bookmarks to the PDF file that is generated.
pyPdf seems to have only read support for outlines. Reportlab would allow me to create them, but the opensource version does not support loading PDF files, so that doesn't work to add outlines to an existing file.
Is there any way I can add outlines to an existing PDF using Python, or any library that would allow that?

https://github.com/yutayamamoto/pdfoutline
I made a python library just for adding an outline to an existing PDF file.

It looks like pypdf can do the job. See the add_outline_item method in the documentation.

We had a similar problem in WeasyPrint: cairo produces the PDF files but does not support bookmarks/outlines or hyperlinks. In the end we bit the bullet, read the PDF spec, and did it ourselves.
WeasyPrint’s pdf.py has a simple PDF parser and writer that can add/override PDF "objects" to an existing documents. It uses the PDF "update" mechanism and only append at the end of the file.
This module was made for internal use only but I’m open to refactoring it to make it easier to use in other projects.
However the parser takes a few shortcuts and can not parse all valid PDF files. It may need to be adapted if PyPDF’s output is not as nice as cairo’s. From the module’s docstring:
Rather than trying to parse any valid PDF, we make some assumptions
that hold for cairo in order to simplify the code:
All newlines are '\n', not '\r' or '\r\n'
Except for number 0 (which is always free) there is no "free" object.
Most white space separators are made of a single 0x20 space.
Indirect dictionary objects do not contain '>>' at the start of a line except to mark the end of the object, followed by 'endobj'. (In
other words, '>>' markers for sub-dictionaries are indented.)
The Page Tree is flat: all kids of the root page node are page objects, not page tree nodes.

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson

Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.

Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.

You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.

I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table

Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.