I'm building a custom sphinx extension and Directive to render interactive charts on sphinx and ReadTheDocs. The actual data for a chart resides in a .json file.
In a .rst document, including a chart looks like this:
.. chart:: charts/test.json
:width: 400px
:height: 250px
This is the caption of the chart, yay...
The steps I take are:
Render the documentation with a placeholder the same size as the chart (a loading spinner)
Use jQuery (in an added javascript file) to get the URI (which I add as an attribute to the dom node in my Directive) of the json file (in this case charts/test.json)
Use jQuery to fetch the file and parse the JSON
On successful fetch of the data, use the plotly library to render it into a chart and use jQuery to remove the placeholder
The directive looks like:
class PlotlyChartDirective(Directive):
""" Top-level plotly chart directive """
has_content = True
def px_value(argument):
# This is not callable as self.align. We cannot make it a
# staticmethod because we're saving an unbound method in
# option_spec below.
return directives.length_or_percentage_or_unitless(argument, 'px')
required_arguments = 1
optional_arguments = 0
option_spec = {
# TODO allow static images for PDF renders 'altimage': directives.unchanged,
'height': px_value,
'width': px_value,
}
def run(self):
""" Parse a plotly chart directive """
self.assert_has_content()
env = self.state.document.settings.env
# Ensure the current chart ID is initialised in the environment
if 'next_plotly_chart_id' not in env.temp_data:
env.temp_data['next_plotly_chart_id'] = 0
id = env.temp_data['next_plotly_chart_id']
# Handle the URI of the *.json asset
uri = directives.uri(self.arguments[0])
# Create the main node container and store the URI of the file which will be collected later
node = nodes.container()
node['classes'] = ['sphinx-plotly']
# Increment the ID counter ready for the next chart
env.temp_data['next_plotly_chart_id'] += 1
# Only if its a supported builder do we proceed (otherwise return an empty node)
if env.app.builder.name in get_compatible_builders(env.app):
chart_node = nodes.container()
chart_node['classes'] = ['sphinx-plotly-chart', f"sphinx-plotly-chart-id-{id}", f"sphinx-plotly-chart-uri-{uri}"]
placeholder_node = nodes.container()
placeholder_node['classes'] = ['sphinx-plotly-placeholder', f"sphinx-plotly-placeholder-{id}"]
placeholder_node += nodes.caption('', 'Loading...')
node += chart_node
node += placeholder_node
# Add optional chart caption and legend (inspired by Figure directive)
if self.content:
caption_node = nodes.Element() # Anonymous container for parsing
self.state.nested_parse(self.content, self.content_offset, caption_node)
first_node = caption_node[0]
if isinstance(first_node, nodes.paragraph):
caption = nodes.caption(first_node.rawsource, '', *first_node.children)
caption.source = first_node.source
caption.line = first_node.line
node += caption
elif not (isinstance(first_node, nodes.comment) and len(first_node) == 0):
error = self.state_machine.reporter.error(
'Chart caption must be a paragraph or empty comment.',
nodes.literal_block(self.block_text, self.block_text),
line=self.lineno)
return [node, error]
if len(caption_node) > 1:
node += nodes.legend('', *caption_node[1:])
return [node]
No matter where I look in the source code of the Figure and Image directives (on which I'm basing this) I can't figure out how to copy the acutal image from its input location, into the static folder in the build directory.
Without copying the *.json file specified in the argument to my directive, I always get a file not found!
I've tried hard to find supporting methods (I assumed that the Sphinx app instance would have an add_static_file() method just like it has an add_css_file() method).
I've also tried adding a list of files to copy to the app instance, then copying assets at the end of the build (but am being thwarted because you can't add attributes to the Sphinx class)
QUESTION IN A NUTSHELL
In a custom directive, how do you copy an asset (whose path is specified by an argument to the directive) to the build's _static directory?
I realised it was straightforward to use sphinx's copyfile utility (which only copies if a file is changed, so is quick) right within the directive's run() method.
See how I get the src_uri and build_uri and copy the file directly within this updated Directive:
class PlotlyChartDirective(Directive):
""" Top-level plotly chart directive """
has_content = True
def px_value(argument):
# This is not callable as self.align. We cannot make it a
# staticmethod because we're saving an unbound method in
# option_spec below.
return directives.length_or_percentage_or_unitless(argument, 'px')
required_arguments = 1
optional_arguments = 0
option_spec = {
# TODO allow static images for PDF renders 'altimage': directives.unchanged,
'height': px_value,
'width': px_value,
}
def run(self):
""" Parse a plotly chart directive """
self.assert_has_content()
env = self.state.document.settings.env
# Ensure the current chart ID is initialised in the environment
if 'next_plotly_chart_id' not in env.temp_data:
env.temp_data['next_plotly_chart_id'] = 0
# Get the ID of this chart
id = env.temp_data['next_plotly_chart_id']
# Handle the src and destination URI of the *.json asset
uri = directives.uri(self.arguments[0])
src_uri = os.path.join(env.app.builder.srcdir, uri)
build_uri = os.path.join(env.app.builder.outdir, '_static', uri)
# Create the main node container and store the URI of the file which will be collected later
node = nodes.container()
node['classes'] = ['sphinx-plotly']
# Increment the ID counter ready for the next chart
env.temp_data['next_plotly_chart_id'] += 1
# Only if its a supported builder do we proceed (otherwise return an empty node)
if env.app.builder.name in get_compatible_builders(env.app):
# Make the directories and copy file (if file has changed)
destdir = os.path.dirname(build_uri)
if not os.path.exists(destdir):
os.makedirs(destdir)
copyfile(src_uri, build_uri)
width = self.options.pop('width', DEFAULT_WIDTH)
height = self.options.pop('height', DEFAULT_HEIGHT)
chart_node = nodes.container()
chart_node['classes'] = ['sphinx-plotly-chart', f"sphinx-plotly-chart-id-{id}", f"sphinx-plotly-chart-uri-{uri}"]
placeholder_node = nodes.container()
placeholder_node['classes'] = ['sphinx-plotly-placeholder', f"sphinx-plotly-placeholder-{id}"]
placeholder_node += nodes.caption('', 'Loading...')
node += chart_node
node += placeholder_node
# Add optional chart caption and legend (as per figure directive)
if self.content:
caption_node = nodes.Element() # Anonymous container for parsing
self.state.nested_parse(self.content, self.content_offset, caption_node)
first_node = caption_node[0]
if isinstance(first_node, nodes.paragraph):
caption = nodes.caption(first_node.rawsource, '', *first_node.children)
caption.source = first_node.source
caption.line = first_node.line
node += caption
elif not (isinstance(first_node, nodes.comment) and len(first_node) == 0):
error = self.state_machine.reporter.error(
'Chart caption must be a paragraph or empty comment.',
nodes.literal_block(self.block_text, self.block_text),
line=self.lineno)
return [node, error]
if len(caption_node) > 1:
node += nodes.legend('', *caption_node[1:])
return [node]
Related
I've built a ctypes interface to Libxml2, the Python xmlDoc is:
class xmlDoc(ctypes.Structure):
_fields_ = [
("_private",ctypes.c_void_p), # application data
("type",ctypes.c_uint16), # XML_DOCUMENT_NODE, must be second !
("name",ctypes.c_char_p), # name/filename/URI of the document
("children",ctypes.c_void_p), # the document tree
("last",ctypes.c_void_p), # last child link
("parent",ctypes.c_void_p), # child->parent link
("next",ctypes.c_void_p), # next sibling link
("prev",ctypes.c_void_p), # previous sibling link
("doc",ctypes.c_void_p), # autoreference to itself End of common part
("compression",ctypes.c_int), # level of zlib compression
("standalone",ctypes.c_int), # standalone document (no external refs) 1 if standalone="yes" 0 if sta
("intSubset",ctypes.c_void_p), # the document internal subset
("extSubset",ctypes.c_void_p), # the document external subset
("oldNs",ctypes.c_void_p), # Global namespace, the old way
("version",ctypes.c_char_p), # the XML version string
("encoding",ctypes.c_char_p), # external initial encoding, if any
("ids",ctypes.c_void_p), # Hash table for ID attributes if any
("refs",ctypes.c_void_p), # Hash table for IDREFs attributes if any
("URL",ctypes.c_char_p), # The URI for that document
("charset",ctypes.c_int), # Internal flag for charset handling, actually an xmlCharEncoding
("dict",ctypes.c_void_p), # dict used to allocate names or NULL
("psvi",ctypes.c_void_p), # for type/PSVI information
("parseFlags",ctypes.c_int), # set of xmlParserOption used to parse the document
("properties",ctypes.c_int), # set of xmlDocProperties for this document set at the end of parsing
]
The char* pointers all make sense, the xmlNode* and xmlDoc* don't, the xmlDoc->doc should point to the same location (from VS Code):
The solution was in my own code, which came from the ctypes templates. Effectively, the template cast ctypes.c_void_p to a ctypes.POINTER(), which in this case is my structure definition of the xmlNode. The line of code is:
# perfect use of lambda
xmlNode = lambda x: ctypes.cast(x, ctypes.POINTER(LibXml.xmlNode))
for the fixed code:
def InsertChild(tree: QTreeWidget, item: QTreeWidgetItem, node: ctypes.c_void_p):
cur = node.contents
xmlNode = lambda x: ctypes.cast(x, ctypes.POINTER(LibXml.xmlNode))
while cur:
item.setText(0, cur.name.decode('utf-8'))
# if cur.content: item.setText(1, cur.content.decode('utf-8'))
item.setText(2, utils.PtrToHex(ctypes.addressof(cur)))
if cur.children:
child = QTreeWidgetItem(tree);
item.addChild(child);
InsertChild(tree, child, xmlNode(cur.children))
if cur.next:
cur = xmlNode(cur.next)
item = QTreeWidgetItem(tree);
else: cur = None
return
I am trying the example from the Google repo:
https://github.com/googleapis/python-documentai/blob/HEAD/samples/snippets/quickstart_sample.py
I have an error:
metadata=[('x-goog-request-params', 'name=projects/my_proj_id/locations/us/processors/my_processor_id'), ('x-goog-api-client', 'gl-python/3.8.10 grpc/1.38.1 gax/1.30.0 gapic/1.0.0')]), last exception: 503 DNS resolution failed for service: https://us-documentai.googleapis.com/v1/
My full code:
from google.cloud import documentai_v1 as documentai
import os
# TODO(developer): Uncomment these variables before running the sample.
project_id= '123456789'
location = 'us' # Format is 'us' or 'eu'
processor_id = '1a23345gh823892' # Create processor in Cloud Console
file_path = 'document.jpg'
os.environ['GRPC_DNS_RESOLVER'] = 'native'
def quickstart(project_id: str, location: str, processor_id: str, file_path: str):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}:process"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "image/jpeg"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
paragraphs = page.paragraphs
for paragraph in paragraphs:
print(paragraph)
paragraph_text = get_text(paragraph.layout, document)
print(f"Paragraph text: {paragraph_text}")
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def main ():
quickstart (project_id = project_id, location = location, processor_id = processor_id, file_path = file_path)
if __name__ == '__main__':
main ()
FYI, on the Google Cloud website it stated that the endpoint is:
https://us-documentai.googleapis.com/v1/projects/123456789/locations/us/processors/1a23345gh823892:process
I can use the web interface to run DocumentAI so it is working. I just have the problem with Python code.
Any suggestion is appreciated.
I would suspect the GRPC_DNS_RESOLVER environment variable to be the root cause. Did you try with the corresponding line commented out? Why was it added in your code?
I'm using PyPDF2 to alter a PDF document (adding bookmarks). So I need to read in the entire source PDF, and write it out, keeping as much of the data intact as possible. Merely writing each page into a new PDF object may not be sufficient to preserve document metadata.
PdfFileWriter() does have a number of methods for copying an entire file: cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot. However, they all have problems.
If I use cloneDocumentFromReader or appendPagesFromReader, I get a valid PDF file, with the correct number of pages, but all pages are blank.
If I use cloneReaderDocumentRoot, I get a minimal valid PDF file, but with no pages or data.
This has been asked before, but with no successful answers.
Other questions have asked about Blank pages in PyPDF2, but I can't apply the answer given.
Here's my code:
def bookmark(incomingFile):
reader = PdfFileReader(incomingFile)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
#writer.cloneDocumentFromReader(reader)
my_table_of_contents = [
('Page 1', 0),
('Page 2', 1),
('Page 3', 2)
]
# writer.addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit')
for title, pagenum in my_table_of_contents:
writer.addBookmark(title, pagenum, parent=None)
writer.setPageMode("/UseOutlines")
with open(incomingFile, "wb") as fp:
writer.write(fp)
I tend to get errors when PyPDF2 can't add a bookmark to the PdfFileWriter object, because it doesn't have any pages, or similar.
I also wrestled with this a lot, finally found that PyPDF2 has this issue.
Basically I copied this answer's code into C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py (this will depend on your distribution) around line 382 for the cloneDocumentFromReader function.
After that I was able to append the reader pages to the writer with writer.cloneDocumentFromReader(pdf) and, in my case, to update PDF Metadata (Subject, Keywords, etc.).
Hope this helps you
'''
Create a copy (clone) of a document from a PDF file reader
:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:
:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
# Variables used for after cloning the root to
# improve pre- and post- cloning experience
mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)
# If there have already been any number of pages added
if oldPages[NameObject("/Count")] > 0:
# Keep them
mustAddTogether = True
else:
# Through the page object out
if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)
# Clone the reader's root document
self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)
# Sweep for all indirect references
externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
# Delete the stack to reset
del self.stack
#Clean-Up Time!!!
# Get the new root of the PDF
realRoot = self.getObject(newRootRef)
# Get the new pages tree root and its ID Number
tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)
# Make an IndirectObject just for the new Pages
self._pages = IndirectObject(newIdNumForPages, 0, self)
# If there are any pages to add back in
if mustAddTogether:
# Set the new page's root's parent to the old
# page's root's reference
tmpPages[NameObject("/Parent")] = oldPagesRef
# Add the reference to the new page's root in
# the old page's kids array
newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)
# Set all references to the root of the old/new
# page's root
self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef
# Update the count attribute of the page's root
oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
else:
# Bump up the info's reference b/c the old
# page's tree was bumped off
self._info = newInfoRef
I have this script:
#!/usr/bin/env python3
import boto3
import argparse
import time
ec2 = boto3.resource('ec2')
dstamp = time.strftime("_%m-%d-%Y-0")
parser = argparse.ArgumentParser(description='Create Image(AMI) from Instance tag:Name Value')
parser.add_argument('names', nargs='+', type=str.upper, help='instance name or list of names to create images from')
args = parser.parse_args()
# List Source Instances for Image/Tag Creation
for instance in ec2.instances.all():
# Pull Name tags from source instances
for name in instance.tags:
if name["Key"] == 'Name':
instancename = name["Value"]
# Check for Name Tag Match with args
for iname in args.names:
if iname == instancename:
# Create an image if we have a match
image = instance.create_image(
Description=f"Created from Source: {instance.id} Name: {instancename}",
Name=instancename + dstamp,
NoReboot=True)
print('New: {} from instance.id: {} {}'.format(image.id, instance.id, instancename))
# ----------------------------------------------
# Can't copy tags from src instance - cause of auto-generated by Cloudformation Tags
# error I got: "Tag keys starting with 'aws:' are reserved for internal use"
# So we skip any tag [Key] named 'aws:'
# ----------------------------------------------
for tag in instance.tags:
dst_tags = []
if tag['Key'].startswith('aws:'):
print("Skip tag that starts with 'aws:' " + tag['Key'])
else:
dst_tags.append(tag)
print(' Tags:', dst_tags)
image.create_tags(Tags=dst_tags)
This is working perfectly, but the final function I am missing is to apply the tags to the underlying volume snapshots within the newly created image. Do I have to totally switch to client = boto3.client('ec2') in order to tag my volume snapshots?
To put it another way - how are people who are using images for backup tagging their volume snapshots?
I have been working with boto3 and python 3 for all of 3 weeks along with my regular duties, any help would be appreciated.
Consider a reStructuredText document with this skeleton:
Main Title
==========
text text text text text
Subsection
----------
text text text text text
.. my-import-from:: file1
.. my-import-from:: file2
The my-import-from directive is provided by a document-specific Sphinx extension, which is supposed to read the file provided as its argument, parse reST embedded in it, and inject the result as a section in the current input file. (Like autodoc, but for a different file format.) The code I have for that, right now, looks like this:
class MyImportFromDirective(Directive):
required_arguments = 1
def run(self):
src, srcline = self.state_machine.get_source_and_line()
doc_file = os.path.normpath(os.path.join(os.path.dirname(src),
self.arguments[0]))
self.state.document.settings.record_dependencies.add(doc_file)
doc_text = ViewList()
try:
doc_text = extract_doc_from_file(doc_file)
except EnvironmentError as e:
raise self.error(e.filename + ": " + e.strerror) from e
doc_section = nodes.section()
doc_section.document = self.state.document
# report line numbers within the nested parse correctly
old_reporter = self.state.memo.reporter
self.state.memo.reporter = AutodocReporter(doc_text,
self.state.memo.reporter)
nested_parse_with_titles(self.state, doc_text, doc_section)
self.state.memo.reporter = old_reporter
if len(doc_section) == 1 and isinstance(doc_section[0], nodes.section):
doc_section = doc_section[0]
# If there was no title, synthesize one from the name of the file.
if len(doc_section) == 0 or not isinstance(doc_section[0], nodes.title):
doc_title = nodes.title()
doc_title.append(make_title_text(doc_file))
doc_section.insert(0, doc_title)
return [doc_section]
This works, except that the new section is injected as a child of the current section, rather than a sibling. In other words, the example document above produces a TOC tree like this:
Main Title
Subsection
File1
File2
instead of the desired
Main Title
Subsection
File1
File2
How do I fix this? The Docutils documentation is ... inadequate, particularly regarding control of section depth. One obvious thing I have tried is returning doc_section.children instead of [doc_section]; that completely removes File1 and File2 from the TOC tree (but does make the section headers in the body of the document appear to be for the right nesting level).
I don't think it is possible to do this by returning the section from the directive (without doing something along the lines of what Florian suggested), as it will get appended to the 'current' section. You can, however, add the section via self.state.section as I do in the following (handling of options removed for brevity)
class FauxHeading(object):
"""
A heading level that is not defined by a string. We need this to work with
the mechanics of
:py:meth:`docutils.parsers.rst.states.RSTState.check_subsection`.
The important thing is that the length can vary, but it must be equal to
any other instance of FauxHeading.
"""
def __init__(self, length):
self.length = length
def __len__(self):
return self.length
def __eq__(self, other):
return isinstance(other, FauxHeading)
class ParmDirective(Directive):
required_arguments = 1
optional_arguments = 0
has_content = True
option_spec = {
'type': directives.unchanged,
'precision': directives.nonnegative_int,
'scale': directives.nonnegative_int,
'length': directives.nonnegative_int}
def run(self):
variableName = self.arguments[0]
lineno = self.state_machine.abs_line_number()
secBody = None
block_length = 0
# added for some space
lineBlock = nodes.line('', '', nodes.line_block())
# parse the body of the directive
if self.has_content and len(self.content):
secBody = nodes.container()
block_length += nested_parse_with_titles(
self.state, self.content, secBody)
# keeping track of the level seems to be required if we want to allow
# nested content. Not sure why, but fits with the pattern in
# :py:meth:`docutils.parsers.rst.states.RSTState.new_subsection`
myLevel = self.state.memo.section_level
self.state.section(
variableName,
'',
FauxHeading(2 + len(self.options) + block_length),
lineno,
[lineBlock] if secBody is None else [lineBlock, secBody])
self.state.memo.section_level = myLevel
return []
I don't know how to do it directly inside your custom directive. However, you can use a custom transform to raise the File1 and File2 nodes in the tree after parsing. For example, see the transforms in the docutils.transforms.frontmatter module.
In your Sphinx extension, use the Sphinx.add_transform method to register the custom transform.
Update: You can also directly register the transform in your directive by returning one or more instances of the docutils.nodes.pending class in your node list. Make sure to call the note_pending method of the document in that case (in your directive you can get the document via self.state_machine.document).