Consider a reStructuredText document with this skeleton:
Main Title
==========
text text text text text
Subsection
----------
text text text text text
.. my-import-from:: file1
.. my-import-from:: file2
The my-import-from directive is provided by a document-specific Sphinx extension, which is supposed to read the file provided as its argument, parse reST embedded in it, and inject the result as a section in the current input file. (Like autodoc, but for a different file format.) The code I have for that, right now, looks like this:
class MyImportFromDirective(Directive):
required_arguments = 1
def run(self):
src, srcline = self.state_machine.get_source_and_line()
doc_file = os.path.normpath(os.path.join(os.path.dirname(src),
self.arguments[0]))
self.state.document.settings.record_dependencies.add(doc_file)
doc_text = ViewList()
try:
doc_text = extract_doc_from_file(doc_file)
except EnvironmentError as e:
raise self.error(e.filename + ": " + e.strerror) from e
doc_section = nodes.section()
doc_section.document = self.state.document
# report line numbers within the nested parse correctly
old_reporter = self.state.memo.reporter
self.state.memo.reporter = AutodocReporter(doc_text,
self.state.memo.reporter)
nested_parse_with_titles(self.state, doc_text, doc_section)
self.state.memo.reporter = old_reporter
if len(doc_section) == 1 and isinstance(doc_section[0], nodes.section):
doc_section = doc_section[0]
# If there was no title, synthesize one from the name of the file.
if len(doc_section) == 0 or not isinstance(doc_section[0], nodes.title):
doc_title = nodes.title()
doc_title.append(make_title_text(doc_file))
doc_section.insert(0, doc_title)
return [doc_section]
This works, except that the new section is injected as a child of the current section, rather than a sibling. In other words, the example document above produces a TOC tree like this:
Main Title
Subsection
File1
File2
instead of the desired
Main Title
Subsection
File1
File2
How do I fix this? The Docutils documentation is ... inadequate, particularly regarding control of section depth. One obvious thing I have tried is returning doc_section.children instead of [doc_section]; that completely removes File1 and File2 from the TOC tree (but does make the section headers in the body of the document appear to be for the right nesting level).
I don't think it is possible to do this by returning the section from the directive (without doing something along the lines of what Florian suggested), as it will get appended to the 'current' section. You can, however, add the section via self.state.section as I do in the following (handling of options removed for brevity)
class FauxHeading(object):
"""
A heading level that is not defined by a string. We need this to work with
the mechanics of
:py:meth:`docutils.parsers.rst.states.RSTState.check_subsection`.
The important thing is that the length can vary, but it must be equal to
any other instance of FauxHeading.
"""
def __init__(self, length):
self.length = length
def __len__(self):
return self.length
def __eq__(self, other):
return isinstance(other, FauxHeading)
class ParmDirective(Directive):
required_arguments = 1
optional_arguments = 0
has_content = True
option_spec = {
'type': directives.unchanged,
'precision': directives.nonnegative_int,
'scale': directives.nonnegative_int,
'length': directives.nonnegative_int}
def run(self):
variableName = self.arguments[0]
lineno = self.state_machine.abs_line_number()
secBody = None
block_length = 0
# added for some space
lineBlock = nodes.line('', '', nodes.line_block())
# parse the body of the directive
if self.has_content and len(self.content):
secBody = nodes.container()
block_length += nested_parse_with_titles(
self.state, self.content, secBody)
# keeping track of the level seems to be required if we want to allow
# nested content. Not sure why, but fits with the pattern in
# :py:meth:`docutils.parsers.rst.states.RSTState.new_subsection`
myLevel = self.state.memo.section_level
self.state.section(
variableName,
'',
FauxHeading(2 + len(self.options) + block_length),
lineno,
[lineBlock] if secBody is None else [lineBlock, secBody])
self.state.memo.section_level = myLevel
return []
I don't know how to do it directly inside your custom directive. However, you can use a custom transform to raise the File1 and File2 nodes in the tree after parsing. For example, see the transforms in the docutils.transforms.frontmatter module.
In your Sphinx extension, use the Sphinx.add_transform method to register the custom transform.
Update: You can also directly register the transform in your directive by returning one or more instances of the docutils.nodes.pending class in your node list. Make sure to call the note_pending method of the document in that case (in your directive you can get the document via self.state_machine.document).
Related
I'm using serialize and deserialize right now, and when decoding the serialized textbuffer with utf-8 I get this:
GTKTEXTBUFFERCONTENTS-0001 <text_view_markup>
<tags>
<tag name="bold" priority="1">
<attr name="weight" type="gint" value="700" />
</tag>
<tag name="#efef29292929" priority="2">
<attr name="foreground-gdk" type="GdkColor" value="efef:2929:2929" />
</tag>
<tag name="underline" priority="0">
<attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />
</tag>
</tags>
<text><apply_tag name="underline">At</apply_tag> the first <apply_tag name="bold">comes</apply_tag> rock! <apply_tag name="underline">Rock</apply_tag>, <apply_tag name="bold">paper,</apply_tag> <apply_tag name="#efef29292929">scissors!</apply_tag></text>
</text_view_markup>
I'm trying to apply the tags using some html tags like <u></u><b></b>, as I asked before and that was closed as a duplicate I'll be asking differently. So, how can I tell where these tags are ending if all they ends with </apply_tag>, instead of something like </apply_tag name="nameoftag"> I tried this before:
def correctTags(text):
tags = []
newstring = ''
for i in range(len(text)):
if string[i] == '<' and i+18 <= len(text):
if text[i+17] == '#':
tags.append('</font color>')
elif text[i+17] == 'b':
tags.append('</b>')
elif text[i+17] == 'u':
tags.append('</u>')
newstring = string.replace('<apply_tag name="#', '<font color="#').replace('<apply_tag name="bold">', '<b>').replace('<apply_tag name="underline">', '<u>')
for j in tags:
newstring = newstring.replace('</apply_tag>', j, 1)
return '<text>' + newstring + '</text>'
But there is a problem with inner tags, they will be closed where it shouldn't be.
I think maybe the answer is gtk.TextBuffer.register_serialize_format as I think this should serialize using the mime that I pass to it, like html, and then I should know where the tags are ending. But I didn't found any example extensive friendly usage of it.
I found the solution to get tags correctly out of serialized textbuffer at Serialising Gtk TextBuffers to HTML, it isn't register_serialize_format, but as was said at the site it's possible to write a serializer but the documentation is sparse (and for that I think is using register_serialize_format). Either way, the solution uses htlm.parser and xml.etree.ElementTree, but it's possible to use BeautifulSoup.
Basically, this script will handle the serialized textbuffer content using html paser, the hard work starts at the feed, that receive byte content (the serialized textbuffer content) and returns a string (the formated text with the html tags), first it'll find the index of <text_view_markup> dropping out the reader GTKTEXTBUFFERCONTENTS-0001 (this is what couldn't be decoded using decode('utf-8')) as it will result in "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position : invalid start byte", you can use decode('utf-8', erros='ignore') or erros='replace' for that, but as the feed method will drop this part the content is decoded with simple .decode().
Then tags and text will be handled separetly, first the tags will be handled and here I used xml.etree.ElementTree, but it's possible use beautifulsoup as the original script, after the tags are handled feed is called and the text is passed, this feed is the method of HTMLParser.
Also for the tags it's possible handle more than italis, bold, and color, you just need to update the tag2html dictionary.
Besides of not using beautifulsoup I made some other changes, as for the tag name, all the tags has names and so they are not using id, my color tag also already has hex values so I didn't need use the pango_to_html_hex method. And here is how it looks right now:
from html.parser import HTMLParser
from typing import Dict, List, Optional, Tuple
from xml.etree.ElementTree import fromstring
from gi import require_version
require_version('Pango', '1.0')
from gi.repository import Pango
class PangoToHtml(HTMLParser):
"""Decode a subset of Pango markup and serialize it as HTML.
Only the Pango markup used within Gourmet is handled, although expanding it
is not difficult.
Due to the way that Pango attributes work, the HTML is not necessarily the
simplest. For example italic tags may be closed early and reopened if other
attributes, eg. bold, are inserted mid-way:
<i> italic text </i><i><u>and underlined</u></i>
This means that the HTML resulting from the conversion by this object may
differ from the original that was fed to the caller.
"""
def __init__(self):
super().__init__()
self.markup_text: str = "" # the resulting content
self.current_opening_tags: str = "" # used during parsing
self.current_closing_tags: List = [] # used during parsing
# The key is the Pango id of a tag, and the value is a tuple of opening
# and closing html tags for this id.
self.tags: Dict[str: Tuple[str, str]] = {}
tag2html: Dict[str, Tuple[str, str]] = {
Pango.Style.ITALIC.value_name: ("<i>", "</i>"), # Pango doesn't do <em>
str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
"foreground-gdk": (r'<span foreground="{}">', "</span>"),
"background-gdk": (r'<span background="{}">', "</span>")
}
def feed(self, data: bytes) -> str:
"""Convert a buffer (text and and the buffer's iterators to html string.
Unlike an HTMLParser, the whole string must be passed at once, chunks
are not supported.
"""
# Remove the Pango header: it contains a length mark, which we don't
# care about, but which does not necessarily decodes as valid char.
header_end = data.find(b"<text_view_markup>")
data = data[header_end:].decode()
# Get the tags
tags_begin = data.index("<tags>")
tags_end = data.index("</tags>") + len("</tags>")
tags = data[tags_begin:tags_end]
data = data[tags_end:]
# Get the textual content
text_begin = data.index("<text>")
text_end = data.index("</text>") + len("</text>")
text = data[text_begin:text_end]
# Convert the tags to html.
# We know that only a subset of HTML is handled in Gourmet:
# italics, bold, underlined and normal
root = fromstring(tags)
tags_name = list(root.iter('tag'))
tags_attributes = list(root.iter('attr'))
tags = [ [tag_name, tag_attribute] for tag_name, tag_attribute in zip(tags_name, tags_attributes)]
tags_list = {}
for tag in tags:
opening_tags = ""
closing_tags = ""
tag_name = tag[0].attrib['name']
vtype = tag[1].attrib['type']
value = tag[1].attrib['value']
name = tag[1].attrib['name']
if vtype == "GdkColor": # Convert colours to html
if name in ['foreground-gdk', 'background-gdk']:
opening, closing = self.tag2html[name]
hex_color = f'{value.replace(":","")}' #hex color already handled by gtk.gdk.color.to_string() method
opening = opening.format(hex_color)
else:
continue # no idea!
else:
opening, closing = self.tag2html[value]
opening_tags += opening
closing_tags = closing + closing_tags # closing tags are FILO
tags_list[tag_name] = opening_tags, closing_tags
if opening_tags:
tags_list[tag_name] = opening_tags, closing_tags
self.tags = tags_list
# Create a single output string that will be sequentially appended to
# during feeding of text. It can then be returned once we've parse all
self.markup_text = ""
self.current_opening_tags = ""
self.current_closing_tags = [] # Closing tags are FILO
super().feed(text)
return self.markup_text
def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
# The pango tags are either "apply_tag", or "text". We only really care
# about the "apply_tag". There could be an assert, but we let the
# parser quietly handle nonsense.
if tag == "apply_tag":
attrs = dict(attrs)
tag_name = attrs.get('name')
tags = self.tags.get(tag_name)
if tags is not None:
(self.current_opening_tags, closing_tag) = tags
self.current_closing_tags.append(closing_tag)
def handle_data(self, data: str) -> None:
data = self.current_opening_tags + data
self.markup_text += data
def handle_endtag(self, tag: str) -> None:
if self.current_closing_tags: # Can be empty due to closing "text" tag
self.markup_text += self.current_closing_tags.pop()
self.current_opening_tags = ""
Also a big thanks to Cyril Danilevski who wrote this, all credits to him. And as he explained, "There is also , that mark the beginning and end
of a TextBuffer's content." so if you follow allong the example from the site, at the handle_endtag it has self.markup_text += self.current_closing_tags.pop() and that will try to pop a empty list, so I recommend anyone who wants to handle tags also see pango_html.py which handle this by checking if the list is not empty (it's also on the code on this answer at the handle_endtag), there's also a test file test_pango_html.py.
Exemple of usage
import PangoToHtml
start_iter = text_buffer.get_start_iter()
end_iter = text_buffer.get_end_iter()
format = text_buffer.register_serialize_tagset()
exported = text_buffer.serialize( text_buffer,
format,
start_iter,
end_iter )
p = PangoToHtml()
p.feed(exported)
Is it possible in python to pretty print the root's attributes?
I used etree to extend the attributes of the child tag and then I had overwritten the existing file with the new content. However during the first generation of the XML, we were using a template where the attributes of the root tag were listed one per line and now with the etree I don't manage to achieve the same result.
I found similar questions but they were all referring to the tutorial of etree, which I find incomplete.
Hopefully someone has found a solution for this using etree.
EDIT: This is for custom XML so HTML Tidy (which was proposed in the comments), doesn't work for this.
Thanks!
generated_descriptors = list_generated_files(generated_descriptors_folder)
counter = 0
for g in generated_descriptors:
if counter % 20 == 0:
print "Extending Descriptor # %s out of %s" % (counter, len(descriptor_attributes))
with open(generated_descriptors_folder + "\\" + g, 'r+b') as descriptor:
root = etree.XML(descriptor.read(), parser=parser)
# Go through every ContextObject to check if the block is mandatory
for context_object in root.findall('ContextObject'):
for attribs in descriptor_attributes:
if attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'false')
elif attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] not in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'true')
# Sort the ContextObjects based on allow-null and their name
context_objects = root.findall('ContextObject')
context_objects_sorted = sorted(context_objects, key=lambda c: (c.attrib['allow-null'], c.attrib['name']))
root[:] = context_objects_sorted
# Remove mandatoryobjects from Descriptor attributes and pretty print
root.attrib.pop("mandatoryobjects", None)
# paste new line here
# Convert to string in order to write the enhanced descriptor
xml = etree.tostring(root, pretty_print=True, encoding="UTF-8", xml_declaration=True)
# Write the enhanced descriptor
descriptor.seek(0) # Set cursor at beginning of the file
descriptor.truncate(0) # Make sure that file is empty
descriptor.write(xml)
descriptor.close()
counter+=1
In python-docx, the paragraph object has a method insert_paragraph_before that allows inserting text before itself:
p.insert_paragraph_before("This is a text")
There is no insert_paragraph_after method, but I suppose that a paragraph object knows sufficiently about itself to determine which paragraph is next in the list. Unfortunately, the inner workings of the python-docx AST are a little intricate (and not really documented).
I wonder how to program a function with the following spec?
def insert_paragraph_after(para, text):
Trying to make sense of the inner workings of docx made me dizzy, but fortunately, it's easy enough to accomplish what you want, since the internal object already has the necessairy method addnext, which is all we need:
from docx import Document
from docx.text.paragraph import Paragraph
from docx.oxml.xmlchemy import OxmlElement
def insert_paragraph_after(paragraph, text=None, style=None):
"""Insert a new paragraph after the given paragraph."""
new_p = OxmlElement("w:p")
paragraph._p.addnext(new_p)
new_para = Paragraph(new_p, paragraph._parent)
if text:
new_para.add_run(text)
if style is not None:
new_para.style = style
return new_para
def main():
# Create a minimal document
document = Document()
p1 = document.add_paragraph("First Paragraph.")
p2 = document.add_paragraph("Second Paragraph.")
# Insert a paragraph wedged between p1 and p2
insert_paragraph_after(p1, "Paragraph One And A Half.")
# Test if the function succeeded
document.save(r"D:\somepath\docx_para_after.docx")
if __name__ == "__main__":
main()
Please refer below details:
para1 = document.add_paragraph("Hello World")
para2 = document.add_paragraph("Testing!!")
p1 = para1._p
p1.addnext(para2._p)
Reference
In the mean time I found another method, more high level (though perhaps not as elegant). It essentially finds the parent, lists the children, works out its own position in line and then gets the next one.
def par_index(paragraph):
"Get the index of the paragraph in the document"
doc = paragraph._parent
# the paragraphs elements are being generated on the fly,
# they change all the time
# so in order to index, we must use the elements
l_elements = [p._element for p in doc.paragraphs]
return l_elements.index(paragraph._element)
def insert_paragraph_after(paragraph, text, style=None):
"""
Add a paragraph to a docx document, after this one.
"""
doc = paragraph._parent
i = par_index(paragraph) + 1 # next
if i <= len(doc.paragraphs):
# we find the next paragraph and we insert before:
next_par = doc.paragraphs[i]
new_par = next_par.insert_paragraph_before(text, style)
else:
# we reached the end, so we need to create a new one:
new_par = parent.add_paragraph(text, style)
return new_par
One advantage is that it mostly avoids getting into the inner workings.
I have existing code that processes the output from suds.client.Client(...).service.GetFoo(). Now that part of the flow has changed and we are no longer using SOAP, instead receiving the same XML through other channels. I would like to re-use the existing code by using the suds Typed unmarshaller, but so far have not been successful.
I came 90% of the way using the Basic unmarshaller:
tree = suds.umx.basic.Basic().process(xmlroot)
This gives me the nice tree of objects with attributes, so that the pre-existing code can access tree[some_index].someAttribute, but the value will of course always be a string, rather than an integer or date or whatever, so the code can still not be re-used as-is.
The original class:
class SomeService(object):
def __init__(self):
self.soap_client = Client(some_wsdl_url)
def GetStuff(self):
return self.soap_client.service.GetStuff()
The drop-in replacement that almost works:
class SomeSourceUntyped(object):
def __init__(self):
self.url = some_url
def GetStuff(self):
xmlfile = urllib2.urlopen(self.url)
xmlroot = suds.sax.parser.Parser().parse(xmlfile)
if xmlroot:
# because the parser creates a document root above the document root
tree = suds.umx.basic.Basic().process(xmlroot)[0]
else:
tree = None
return tree
My vain effort to understand suds.umx.typed.Typed():
class SomeSourceTyped(object):
def __init__(self):
self.url = some_url
self.schema_file_name =
os.path.realpath(os.path.join(os.path.dirname(__file__),'schema.xsd'))
with open(self.schema_file_name) as f:
self.schema_node = suds.sax.parser.Parser().parse(f)
self.schema = suds.xsd.schema.Schema(self.schema_node, "", suds.options.Options())
self.schema_query = suds.xsd.query.ElementQuery(('http://example.com/namespace/','Stuff'))
self.xmltype = self.schema_query.execute(self.schema)
def GetStuff(self):
xmlfile = urllib2.urlopen(self.url)
xmlroot = suds.sax.parser.Parser().parse(xmlfile)
if xmlroot:
unmarshaller = suds.umx.typed.Typed(self.schema)
# I'm still running into an exception, so obviously something is missing:
# " Exception: (document, None, ), must be qref "
# Do I need to call the Parser differently?
tree = unmarshaller.process(xmlroot, self.xmltype)[0]
else:
tree = None
return tree
This is an obscure one.
Bonus caveat: Of course I am in a legacy system that uses suds 0.3.9.
EDIT: further evolution on the code, found how to create SchemaObjects.
I'm currently using xml.dom.minidom to parse some XML in python. After parsing, I'm doing some reporting on the content, and would like to report the line (and column) where the tag started in the source XML document, but I don't see how that's possible.
I'd like to stick with xml.dom / xml.dom.minidom if possible, but if I need to use a SAX parser to get the origin info, I can do that -- ideal in that case would be using SAX to track node location, but still end up with a DOM for my post-processing.
Any suggestions on how to do this? Hopefully I'm just overlooking something in the docs and this extremely easy.
By monkeypatching the minidom content handler I was able to record line and column number for each node (as the 'parse_position' attribute). It's a little dirty, but I couldn't see any "officially sanctioned" way of doing it :) Here's my test script:
from xml.dom import minidom
import xml.sax
doc = """\
<File>
<name>Name</name>
<pos>./</pos>
</File>
"""
def set_content_handler(dom_handler):
def startElementNS(name, tagName, attrs):
orig_start_cb(name, tagName, attrs)
cur_elem = dom_handler.elementStack[-1]
cur_elem.parse_position = (
parser._parser.CurrentLineNumber,
parser._parser.CurrentColumnNumber
)
orig_start_cb = dom_handler.startElementNS
dom_handler.startElementNS = startElementNS
orig_set_content_handler(dom_handler)
parser = xml.sax.make_parser()
orig_set_content_handler = parser.setContentHandler
parser.setContentHandler = set_content_handler
dom = minidom.parseString(doc, parser)
pos = dom.firstChild.parse_position
print("Parent: '{0}' at {1}:{2}".format(
dom.firstChild.localName, pos[0], pos[1]))
for child in dom.firstChild.childNodes:
if child.localName is None:
continue
pos = child.parse_position
print "Child: '{0}' at {1}:{2}".format(child.localName, pos[0], pos[1])
It outputs the following:
Parent: 'File' at 1:0
Child: 'name' at 2:2
Child: 'pos' at 3:2
A different way to hack around the problem is by patching line number information into the document before parsing it. Here's the idea:
LINE_DUMMY_ATTR = '_DUMMY_LINE' # Make sure this string is unique!
def parseXml(filename):
f = file.open(filename, 'r')
l = 0
content = list ()
for line in f:
l += 1
content.append(re.sub(r'<(\w+)', r'<\1 ' + LINE_DUMMY_ATTR + '="' + str(l) + '"', line))
f.close ()
return minidom.parseString ("".join(content))
Then you can retrieve the line number of an element with
int (element.getAttribute (LINE_DUMMY_ATTR))
Quite clearly, this approach has its own set of drawbacks, and if you really need column numbers, too, patching that in will be somewhat more involved. Also, if you want to extract text nodes or comments or use Node.toXml(), you'll have to make sure to strip out LINE_DUMMY_ATTR from any accidental matches, there.
The one advantage of this solution over aknuds1's answer is that it does not require messing with minidom internals.