Is there an equivalent to Latex's \endinput in Python-Sphinx - python

LateX and Python-Sphinx are word-processors, transforming formatted content towards a target document. The usual workflow when working with large documents is to split them in smaller files (sections, chapters) and use a specific file as a table of contents (main.tex in most Latex projects, index.rst in Sphinx).
In Latex, it is possible to stop the processing of a specific file by using a \endinput command (then, the processor will move on to the next section or chapter).
I cannot find such a directive in Python-Sphinx's documentation. Does it exists? If not, is there a way to implement this behavior?

None that I know of. But you can do the inverse by including what you want, which would have the same effect as excluding what you don't want, by using the ifconfig Sphinx extension. You might have to indent what you want.

Related

Extract Geometry Elements from PDF by OCG (by Layer)

So I've spent the good majority of a month on this issue. I'm looking for a way to extract geometry elements (polylines, text, arcs, etc.) from a vectorized PDF organised by the file's OCGs (Optional Content Groups), which are basically PDF layers. Using PDFminer I was able to extract geometry (LTCurves, LTTextBoxes, LTLines, etc.); using PyPDF2, I was able to view how many OCGs were in the PDF, though I was not able to access geometry associated with that OCG. There were a few hacky scripts I've seen and tried online that may have been able to solve this problem, but to no avail. I even resorted to opening the raw PDF data in a text editor and half hazardly removing parts of it to see if I could come up with some custom parsing technique to do this, but again to no avail. Adobe's PDF manual is minimal at best, so that was no help when I was attempting to create a parser. Does anyone know a solution to this.
At this point, I'm open to a solution in any language, using any OS (though I would prefer a solution using Python 3 on Windows or Linux), as long as it is open source / free.
Can anyone here help end this rabbit hole of darkness? Much appreciated!
A PDF document consists of two "types" of data. There is an object oriented "structure" to the document to divide it into pages, and carry meta data (like, for instance, there is this list of Optional Content Groups), and there is a stream oriented list of marking operators that actually "draw" content onto the page.
The fact that there are OCG's, and their names, and a bit about them is stored on object oriented content, and can be extracted by parsing the object content fairly easily. But the membership of the OCG's is NOT stored in the object structure. It can only be found by parsing the Content Stream. A group of marking operators is a member of a particular OCG group when it is preceeded by the content operator /OC /optionacontentgroupname BDC and followed by the operator EMC.
Parsing a content stream is a less than trivial task. There are many tools out there that will do this for you. I would not, myself, attempt to build such a parser from scratch. There is little value in re-writing the wheel.
The complete syntax of PDF is available from many sources. Search the web for "PDF Specification 1.7", or "ISO32000-1:2008". It is a daunting document to read, but it does supply all of the information needed to create both and object and a content parser
If your PDF is organized in OGC layers, then you can use gdal_translate command of GDAL.
Use the following command to check all available OGC layers in your PDF file:
gdalinfo "sample.pdf" -mdd LAYERS
Then, use the following to command to extract the partiular layer:
gdal_translate "sample.pdf" -of PNG sample.png --config GDAL_PDF_LAYERS "your_specific_layer_name"
More details are mentioned here.
Hey #pythonic_programmer, I am able to use this python library pdflayers to disable the default view (visible/not visible) of the layer into the new pdf file.
https://pypi.org/project/pdflayers/
Pretty much it means disable the default state of the layer
in the pdf file: https://helpx.adobe.com/acrobat/using/pdf-layers.html
Any layer not visible meaning that layer will not render to the pdf document when you process (by default).

using :ref: in Python docstrings using Sphinx

I'm using Sphinx to document a python project, and I'm trying to create a reusable tip to be used in several locations.
Typically, I'll use the following syntax in a python file:
"""
.. tip::
I want this tip to be used in several locations. Why?
- Save time
- Work less
"""
Now this works whether I put it at the beginning of the file, right under class definition or right under function definition.
I found Sphinx's manual for :ref:, which suggests to use a label:
.. _my_reusable_tip:
.. tip::
...
And then call this tip with :ref:`my_reusable_tip` anywhere I want.
The manual states that 'it works across files, when section headings are changed, and for all builders that support cross-references'
The thing is, it doesn't matter in which .py file in the project I write the label and tip definition, the :ref:`my_reusable_tip` just displays 'my_reusable_tip', and not the tip itself.
What I'm using to build the documentation is
sphinx-apidoc -f -F -o
make html
I'm pretty sure my logic is flawed in some way, but I can't figure out why.
I know that Sphinx searches the project for reStructuredText and renders it if it can, but I think I'm missing something here.
I tried to add this label in a seperate .py file enclosed in """, and in a separate .txt file without enclosed """.
I tried creating an .rst file with the label definition and rebuild the html documentation.
What am I missing here?
Python 3.4.3 BTW.
In sphinx, a :ref: is simply a more robust way of linking (or referencing) another part of the document. Thus, your use of :ref: will simply provide a hyperlink to the label.
It is not a way of substituting or expanding a block.
Inline substitutions are available using using |...|, however an inline substitution cannot be used to substitute a block as you seem to require.
RestructuredText is not a template language, and thus doesn't provide macro like facilities. In the event you need it, an alternative solution is to use a template library such as mako or jinja to deal with this kind of issue.
Just using reStructuredText directive
.. include:: ./my_reusable_tip.txt
in your rst files?

restructuredtext: use directives for metadata

I'm writing a simple webpage generator based on restructuredtext and I'd like to put tags into the document, like this.
=====
Title
=====
:author: Me
:tags: foo, bar
Here we go ...
What I want now:
get in possession of some kind of document tree
find the tags entry, read it, process it (like print the tags on the command line), remove it and render the remaining tree.
So I'd like to write compatible restructuredtext in case it's being compiled with something different than my program.
Can someone give me a hint? I found this one here http://svn.python.org/projects/external/docutils-0.6/docutils/examples.py showing in the internals method how to obtain the document (and therefore the dom tree), but is this the best way to go or would a regex based approach (find lines, remove them) be a lot easier? Working with the tree would also involve the conversion tree → document and so on.
There are tools that can do this for you. See http://docutils.sourceforge.net/docs/user/links.html
I think I have a nice solution for both problems. First, the core.py file in the docutils distribution shows how to obtain the doctree and how to write it (using a html writer for instance), see publish_from_doctree and publish_doctree. Then, there is docutils.nodes.SparseNodeVisitor which one can subclass and overwrite methods like visit_field to manipulate the document tree in various ways.

What's a good document standard to use programmatically?

I'm writing a program that requires input in the form of a document, it needs to replace a few values, insert a table, and convert it to PDF. It's written in Python + Qt (PyQt). Is there any well known document standard which can be easily used programmatically? It must be cross platform, and preferably open.
I have looked into Microsoft Doc and Docx, which are binary formats and I can't edit them. Python has bindings for it, but they're only on Windows.
Open Office's ODT/ODF is zipped in an xml file, so I can edit that one but there's no command line utilities or any way to programmatically convert the file to a PDF. Open Office provides bindings, but you need to run Open Office from the command line, start a server, etc. And my clients may not have Open Office installed.
RTF is readable from Python, but I couldn't find any way/libraries to convert RTF documents to PDF.
At the moment I'm exporting from Microsoft Word to HTML, replacing the values and using PyQt to convert it to a PDF. However it loses formatting features and looks awful. I'm surprised there isn't a well known library which lets you edit a variety of document formats and convert them into other formats, am I missing something?
Update: Thanks for the advice, I'll have a look at using Latex.
Thanks,
Jackson
Have you looked into using LaTeX documents?
They are perfect to use programatically (compiling documents? You gotta love that...), and you have several Python frameworks you can use such as plasTeX and PyTex.
Exporting a LaTeX documents to PDF is almost immediate.
Since you're already using PyQt anyway, it might be worth looking at Qt's built-in RTF processing module which looks decent. Here's the documentation on detailed content manipulation including inserting tables. Also the QPrinter module's default print-to-file format happens to be PDF.
Without knowing more about your particular needs it's hard to say if these would do what you want, but since your application already has PyQt as a dependency, seems silly to introduce any more without evaluating the functionality you've already got available.
The non-GUI parts of the Qt framework are often overlooked though.
edit: included more links.
You might want to try ReportLab. The open source version can write PDFs, and the commercial version has a lot of really nice abstractions to allow output to a variety of different formats from a single input.
I don't know the kind of odience of your program, Tex is good and i would go with it.
Another possible choice is Excel format, parsing it with xlrd.
I've used it a couple of time and it's pretty straightforward.
Excel file is a good for the following reasons:
Well known format easy to edit
You could prepare a predefined template with constrains and table
Creating XML documents, transforming them to XSL/fo and rendering with Fop or RenderX. If you use docbook as the primary input, there are toolchains freely available for converting that to PDF, RTF, HTML and so forth.
It is rather quirky to use and not my idea of fun, but is does deliver and can be embedded in an application, AFAICT.
Creating docbook is very straightforward as it has a wide range of semantic tags, table support etc to give a "meaningful" markup which can be reliably formatted. The XSL stylesheets are modular and allow parts to be customized or replaced to generate your own look and feel.
It works well for relatively free flow documents with lots of text.
For filling in the blanks kind of documents, a regular reporting engine may be a better fit, or some straighforward XSL stylesheets spitting out the XSL-fo directly.

How to programmatically insert comments into a Microsoft Word document?

Looking for a way to programmatically insert comments (using the comments feature in Word) into a specific location in a MS Word document. I would prefer an approach that is usable across recent versions of MS Word standard formats and implementable in a non-Windows environment (ideally using Python and/or Common Lisp). I have been looking at the OpenXML SDK but can't seem to find a solution there.
Here is what I did:
Create a simple document with word (i.e. a very small one)
Add a comment in Word
Save as docx.
Use the zip module of python to access the archive (docx files are ZIP archives).
Dump the content of the entry "word/document.xml" in the archive. This is the XML of the document itself.
This should give you an idea what you need to do. After that, you can use one of the XML libraries in Python to parse the document, change it and add it back to a new ZIP archive with the extension ".docx". Simply copy every other entry from the original ZIP and you have a new, valid Word document.
There is also a library which might help: openxmllib
If this is server side (non-interactive) use of the Word application itself is unsupported (but I see this is not applicable). So either take that route or use the OpenXML SDK to learn the markup needed to create a comment. With that knowledge it is all about manipulating data.
The .docx format is a ZIP of XML files with a defines structure, so mostly once you get into the ZIP and get the right XML file it becomes a matter of modifying an XML DOM.
The best route might be to take a docx, copy it, add a comment (using Word) to one, and compare. A diff will show you the kind of elements/structures you need to be looking up in the SDK (or ISO/Ecma standard).

Categories

Resources