the library's click_action feature is only supported for BaseShape class members (which a table cell, its text frame, paragraphs and runs are not). Meanwhile a text run's hyperlink attribute only supports setting of external (web-)links. How to add an internal link to a slide within a table, such as to create a table-of-contents?
the following code solves the question above:
from lxml import etree # type: ignore
from pptx.opc.constants import RELATIONSHIP_TYPE # type: ignore
def link_table_cell_to_slide(table_shape, cell, slide):
# pylint: disable=protected-access
rel_id = table_shape._parent.part.relate_to(slide.part, RELATIONSHIP_TYPE.SLIDE)
link_run = cell.text_frame.paragraphs[0].runs[0]
nsmap = link_run._r.nsmap
# trying to do it with format strings and escaping curly braces
# was not doable with pylint (unignorable syntax error)
ns_a = '{' + nsmap['a'] + '}'
ns_r = '{' + nsmap['r'] + '}'
run_properties = link_run._r.find(f'{ns_a}rPr')
hlink = etree.SubElement(run_properties, f'{ns_a}hlinkClick')
hlink.set('action', 'ppaction://hlinksldjump')
hlink.set(f'{ns_r}id', rel_id)
# ... determining the toc_shape, row in the table to link up as well as
# the target slide to link to
link_table_cell_to_slide(toc_shape, row.cells[0], target_slide)
I've built a ctypes interface to Libxml2, the Python xmlDoc is:
class xmlDoc(ctypes.Structure):
_fields_ = [
("_private",ctypes.c_void_p), # application data
("type",ctypes.c_uint16), # XML_DOCUMENT_NODE, must be second !
("name",ctypes.c_char_p), # name/filename/URI of the document
("children",ctypes.c_void_p), # the document tree
("last",ctypes.c_void_p), # last child link
("parent",ctypes.c_void_p), # child->parent link
("next",ctypes.c_void_p), # next sibling link
("prev",ctypes.c_void_p), # previous sibling link
("doc",ctypes.c_void_p), # autoreference to itself End of common part
("compression",ctypes.c_int), # level of zlib compression
("standalone",ctypes.c_int), # standalone document (no external refs) 1 if standalone="yes" 0 if sta
("intSubset",ctypes.c_void_p), # the document internal subset
("extSubset",ctypes.c_void_p), # the document external subset
("oldNs",ctypes.c_void_p), # Global namespace, the old way
("version",ctypes.c_char_p), # the XML version string
("encoding",ctypes.c_char_p), # external initial encoding, if any
("ids",ctypes.c_void_p), # Hash table for ID attributes if any
("refs",ctypes.c_void_p), # Hash table for IDREFs attributes if any
("URL",ctypes.c_char_p), # The URI for that document
("charset",ctypes.c_int), # Internal flag for charset handling, actually an xmlCharEncoding
("dict",ctypes.c_void_p), # dict used to allocate names or NULL
("psvi",ctypes.c_void_p), # for type/PSVI information
("parseFlags",ctypes.c_int), # set of xmlParserOption used to parse the document
("properties",ctypes.c_int), # set of xmlDocProperties for this document set at the end of parsing
The char* pointers all make sense, the xmlNode* and xmlDoc* don't, the xmlDoc->doc should point to the same location (from VS Code):
The solution was in my own code, which came from the ctypes templates. Effectively, the template cast ctypes.c_void_p to a ctypes.POINTER(), which in this case is my structure definition of the xmlNode. The line of code is:
# perfect use of lambda
xmlNode = lambda x: ctypes.cast(x, ctypes.POINTER(LibXml.xmlNode))
for the fixed code:
def InsertChild(tree: QTreeWidget, item: QTreeWidgetItem, node: ctypes.c_void_p):
cur = node.contents
xmlNode = lambda x: ctypes.cast(x, ctypes.POINTER(LibXml.xmlNode))
while cur:
# if cur.content: item.setText(1, cur.content.decode('utf-8'))
item.setText(2, utils.PtrToHex(ctypes.addressof(cur)))
if cur.children:
child = QTreeWidgetItem(tree);
InsertChild(tree, child, xmlNode(cur.children))
cur = xmlNode(
item = QTreeWidgetItem(tree);
else: cur = None
I'm using serialize and deserialize right now, and when decoding the serialized textbuffer with utf-8 I get this:
GTKTEXTBUFFERCONTENTS-0001 <text_view_markup>
<tag name="bold" priority="1">
<attr name="weight" type="gint" value="700" />
<tag name="#efef29292929" priority="2">
<attr name="foreground-gdk" type="GdkColor" value="efef:2929:2929" />
<tag name="underline" priority="0">
<attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />
<text><apply_tag name="underline">At</apply_tag> the first <apply_tag name="bold">comes</apply_tag> rock! <apply_tag name="underline">Rock</apply_tag>, <apply_tag name="bold">paper,</apply_tag> <apply_tag name="#efef29292929">scissors!</apply_tag></text>
I'm trying to apply the tags using some html tags like <u></u><b></b>, as I asked before and that was closed as a duplicate I'll be asking differently. So, how can I tell where these tags are ending if all they ends with </apply_tag>, instead of something like </apply_tag name="nameoftag"> I tried this before:
def correctTags(text):
tags = []
newstring = ''
for i in range(len(text)):
if string[i] == '<' and i+18 <= len(text):
if text[i+17] == '#':
tags.append('</font color>')
elif text[i+17] == 'b':
elif text[i+17] == 'u':
newstring = string.replace('<apply_tag name="#', '<font color="#').replace('<apply_tag name="bold">', '<b>').replace('<apply_tag name="underline">', '<u>')
for j in tags:
newstring = newstring.replace('</apply_tag>', j, 1)
return '<text>' + newstring + '</text>'
But there is a problem with inner tags, they will be closed where it shouldn't be.
I think maybe the answer is gtk.TextBuffer.register_serialize_format as I think this should serialize using the mime that I pass to it, like html, and then I should know where the tags are ending. But I didn't found any example extensive friendly usage of it.
I found the solution to get tags correctly out of serialized textbuffer at Serialising Gtk TextBuffers to HTML, it isn't register_serialize_format, but as was said at the site it's possible to write a serializer but the documentation is sparse (and for that I think is using register_serialize_format). Either way, the solution uses htlm.parser and xml.etree.ElementTree, but it's possible to use BeautifulSoup.
Basically, this script will handle the serialized textbuffer content using html paser, the hard work starts at the feed, that receive byte content (the serialized textbuffer content) and returns a string (the formated text with the html tags), first it'll find the index of <text_view_markup> dropping out the reader GTKTEXTBUFFERCONTENTS-0001 (this is what couldn't be decoded using decode('utf-8')) as it will result in "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position : invalid start byte", you can use decode('utf-8', erros='ignore') or erros='replace' for that, but as the feed method will drop this part the content is decoded with simple .decode().
Then tags and text will be handled separetly, first the tags will be handled and here I used xml.etree.ElementTree, but it's possible use beautifulsoup as the original script, after the tags are handled feed is called and the text is passed, this feed is the method of HTMLParser.
Also for the tags it's possible handle more than italis, bold, and color, you just need to update the tag2html dictionary.
Besides of not using beautifulsoup I made some other changes, as for the tag name, all the tags has names and so they are not using id, my color tag also already has hex values so I didn't need use the pango_to_html_hex method. And here is how it looks right now:
from html.parser import HTMLParser
from typing import Dict, List, Optional, Tuple
from xml.etree.ElementTree import fromstring
from gi import require_version
require_version('Pango', '1.0')
from gi.repository import Pango
class PangoToHtml(HTMLParser):
"""Decode a subset of Pango markup and serialize it as HTML.
Only the Pango markup used within Gourmet is handled, although expanding it
is not difficult.
Due to the way that Pango attributes work, the HTML is not necessarily the
simplest. For example italic tags may be closed early and reopened if other
attributes, eg. bold, are inserted mid-way:
<i> italic text </i><i><u>and underlined</u></i>
This means that the HTML resulting from the conversion by this object may
differ from the original that was fed to the caller.
def __init__(self):
self.markup_text: str = "" # the resulting content
self.current_opening_tags: str = "" # used during parsing
self.current_closing_tags: List = [] # used during parsing
# The key is the Pango id of a tag, and the value is a tuple of opening
# and closing html tags for this id.
self.tags: Dict[str: Tuple[str, str]] = {}
tag2html: Dict[str, Tuple[str, str]] = {
Pango.Style.ITALIC.value_name: ("<i>", "</i>"), # Pango doesn't do <em>
str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
"foreground-gdk": (r'<span foreground="{}">', "</span>"),
"background-gdk": (r'<span background="{}">', "</span>")
def feed(self, data: bytes) -> str:
"""Convert a buffer (text and and the buffer's iterators to html string.
Unlike an HTMLParser, the whole string must be passed at once, chunks
are not supported.
# Remove the Pango header: it contains a length mark, which we don't
# care about, but which does not necessarily decodes as valid char.
header_end = data.find(b"<text_view_markup>")
data = data[header_end:].decode()
# Get the tags
tags_begin = data.index("<tags>")
tags_end = data.index("</tags>") + len("</tags>")
tags = data[tags_begin:tags_end]
data = data[tags_end:]
# Get the textual content
text_begin = data.index("<text>")
text_end = data.index("</text>") + len("</text>")
text = data[text_begin:text_end]
# Convert the tags to html.
# We know that only a subset of HTML is handled in Gourmet:
# italics, bold, underlined and normal
root = fromstring(tags)
tags_name = list(root.iter('tag'))
tags_attributes = list(root.iter('attr'))
tags = [ [tag_name, tag_attribute] for tag_name, tag_attribute in zip(tags_name, tags_attributes)]
tags_list = {}
for tag in tags:
opening_tags = ""
closing_tags = ""
tag_name = tag[0].attrib['name']
vtype = tag[1].attrib['type']
value = tag[1].attrib['value']
name = tag[1].attrib['name']
if vtype == "GdkColor": # Convert colours to html
if name in ['foreground-gdk', 'background-gdk']:
opening, closing = self.tag2html[name]
hex_color = f'{value.replace(":","")}' #hex color already handled by gtk.gdk.color.to_string() method
opening = opening.format(hex_color)
continue # no idea!
opening, closing = self.tag2html[value]
opening_tags += opening
closing_tags = closing + closing_tags # closing tags are FILO
tags_list[tag_name] = opening_tags, closing_tags
if opening_tags:
tags_list[tag_name] = opening_tags, closing_tags
self.tags = tags_list
# Create a single output string that will be sequentially appended to
# during feeding of text. It can then be returned once we've parse all
self.markup_text = ""
self.current_opening_tags = ""
self.current_closing_tags = [] # Closing tags are FILO
return self.markup_text
def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
# The pango tags are either "apply_tag", or "text". We only really care
# about the "apply_tag". There could be an assert, but we let the
# parser quietly handle nonsense.
if tag == "apply_tag":
attrs = dict(attrs)
tag_name = attrs.get('name')
tags = self.tags.get(tag_name)
if tags is not None:
(self.current_opening_tags, closing_tag) = tags
def handle_data(self, data: str) -> None:
data = self.current_opening_tags + data
self.markup_text += data
def handle_endtag(self, tag: str) -> None:
if self.current_closing_tags: # Can be empty due to closing "text" tag
self.markup_text += self.current_closing_tags.pop()
self.current_opening_tags = ""
Also a big thanks to Cyril Danilevski who wrote this, all credits to him. And as he explained, "There is also , that mark the beginning and end
of a TextBuffer's content." so if you follow allong the example from the site, at the handle_endtag it has self.markup_text += self.current_closing_tags.pop() and that will try to pop a empty list, so I recommend anyone who wants to handle tags also see which handle this by checking if the list is not empty (it's also on the code on this answer at the handle_endtag), there's also a test file
Exemple of usage
import PangoToHtml
start_iter = text_buffer.get_start_iter()
end_iter = text_buffer.get_end_iter()
format = text_buffer.register_serialize_tagset()
exported = text_buffer.serialize( text_buffer,
end_iter )
p = PangoToHtml()
How can a docx table be indented? I am trying to line a table up with a tab stop set at 2cm. The following script creates a header, some text and a table:
import docx
from docx.shared import Cm
doc = docx.Document()
style = doc.styles['Normal']
doc.add_paragraph('My header', style='Heading 1')
doc.add_paragraph('\tText is tabbed')
# This indents the paragraph inside, not the table
# style = doc.styles['Table Grid']
# style.paragraph_format.left_indent = Cm(2)
table = doc.add_table(rows=0, cols=2, style="Table Grid")
for rowy in range(1, 5):
row_cells = table.add_row().cells
row_cells[0].text = 'Row {}'.format(rowy)
row_cells[0].width = Cm(5)
row_cells[1].text = ''
row_cells[1].width = Cm(1.2)'output.docx')
It produces a table with no ident as follows:
How can the table be indented as follows?
(preferably without having to load an existing document):
If for example left-indent is added to the Table Grid style (by uncommenting the lines), it will be applied at the paragraph level, not the table level resulting in the following (which is not wanted):
In Microsoft Word, this can be done on the table properties by entering 2.0 cm for Indent from left.
Based on Fred C's answer, I came up with this solution:
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
def indent_table(table, indent):
# noinspection PyProtectedMember
tbl_pr = table._element.xpath('w:tblPr')
if tbl_pr:
e = OxmlElement('w:tblInd')
e.set(qn('w:w'), str(indent))
e.set(qn('w:type'), 'dxa')
This feature is not yet supported by python-docx. It looks like this behavior is produced by the w:tblInd child of the w:tbl element. It's possible you could develop a workaround function to add an element like this using lxml calls on the w:tbl element, which should be available on the ._element attribute of a Table object.
You can find examples of other workaround functions by searching on 'python-docx workaround function' and similar ones by searching on 'python-pptx workaround functions'.
Here's how I did it:
import docx
import lxml
mydoc = docx.Document()
mytab = self.mydoc.add_table(3,3)
nsmap=mytab._element[0].nsmap # For namespaces
searchtag='{%s}tblPr' % nsmap['w'] # w:tblPr
mytag='{%s}tblInd' % nsmap['w'] # w:tblInd
myw='{%s}w' % nsmap['w'] # w:w
mytype='{%s}type' % nsmap['w'] # w:type
for elt in mytab._element:
if elt.tag == searchtag:
Consider a reStructuredText document with this skeleton:
Main Title
text text text text text
text text text text text
.. my-import-from:: file1
.. my-import-from:: file2
The my-import-from directive is provided by a document-specific Sphinx extension, which is supposed to read the file provided as its argument, parse reST embedded in it, and inject the result as a section in the current input file. (Like autodoc, but for a different file format.) The code I have for that, right now, looks like this:
class MyImportFromDirective(Directive):
required_arguments = 1
def run(self):
src, srcline = self.state_machine.get_source_and_line()
doc_file = os.path.normpath(os.path.join(os.path.dirname(src),
doc_text = ViewList()
doc_text = extract_doc_from_file(doc_file)
except EnvironmentError as e:
raise self.error(e.filename + ": " + e.strerror) from e
doc_section = nodes.section()
doc_section.document = self.state.document
# report line numbers within the nested parse correctly
old_reporter = self.state.memo.reporter
self.state.memo.reporter = AutodocReporter(doc_text,
nested_parse_with_titles(self.state, doc_text, doc_section)
self.state.memo.reporter = old_reporter
if len(doc_section) == 1 and isinstance(doc_section[0], nodes.section):
doc_section = doc_section[0]
# If there was no title, synthesize one from the name of the file.
if len(doc_section) == 0 or not isinstance(doc_section[0], nodes.title):
doc_title = nodes.title()
doc_section.insert(0, doc_title)
return [doc_section]
This works, except that the new section is injected as a child of the current section, rather than a sibling. In other words, the example document above produces a TOC tree like this:
Main Title
instead of the desired
Main Title
How do I fix this? The Docutils documentation is ... inadequate, particularly regarding control of section depth. One obvious thing I have tried is returning doc_section.children instead of [doc_section]; that completely removes File1 and File2 from the TOC tree (but does make the section headers in the body of the document appear to be for the right nesting level).
I don't think it is possible to do this by returning the section from the directive (without doing something along the lines of what Florian suggested), as it will get appended to the 'current' section. You can, however, add the section via self.state.section as I do in the following (handling of options removed for brevity)
class FauxHeading(object):
A heading level that is not defined by a string. We need this to work with
the mechanics of
The important thing is that the length can vary, but it must be equal to
any other instance of FauxHeading.
def __init__(self, length):
self.length = length
def __len__(self):
return self.length
def __eq__(self, other):
return isinstance(other, FauxHeading)
class ParmDirective(Directive):
required_arguments = 1
optional_arguments = 0
has_content = True
option_spec = {
'type': directives.unchanged,
'precision': directives.nonnegative_int,
'scale': directives.nonnegative_int,
'length': directives.nonnegative_int}
def run(self):
variableName = self.arguments[0]
lineno = self.state_machine.abs_line_number()
secBody = None
block_length = 0
# added for some space
lineBlock = nodes.line('', '', nodes.line_block())
# parse the body of the directive
if self.has_content and len(self.content):
secBody = nodes.container()
block_length += nested_parse_with_titles(
self.state, self.content, secBody)
# keeping track of the level seems to be required if we want to allow
# nested content. Not sure why, but fits with the pattern in
# :py:meth:`docutils.parsers.rst.states.RSTState.new_subsection`
myLevel = self.state.memo.section_level
FauxHeading(2 + len(self.options) + block_length),
[lineBlock] if secBody is None else [lineBlock, secBody])
self.state.memo.section_level = myLevel
return []
I don't know how to do it directly inside your custom directive. However, you can use a custom transform to raise the File1 and File2 nodes in the tree after parsing. For example, see the transforms in the docutils.transforms.frontmatter module.
In your Sphinx extension, use the Sphinx.add_transform method to register the custom transform.
Update: You can also directly register the transform in your directive by returning one or more instances of the docutils.nodes.pending class in your node list. Make sure to call the note_pending method of the document in that case (in your directive you can get the document via self.state_machine.document).
OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.
I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.
I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.
I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.
<TechAccount Sender="broker" Receiver="insurer">
That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.
Here is my code:
xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
single_tag = '%s,%s' % (i.tag, i.text)
This gives:
'{}ServiceProviderGroupReference,8-2012-08-10', '{}ServiceProviderGroupItemsTotal,\n', '{}Count,13',
As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13
Can anyone point me towards what I need?
Thanks (hope my first post is OK)
This is my code now:
with open(fullpath, 'rb') as xmlFilepath:
xmlfile =
fulltext = '%s' % xmlfile
text = fulltext[2:]
xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
But this returns an error:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
I remove the first two chars as thy are b' and it complained it didn't start with a tag
I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.
Any help on how I can achieve this?
ElementTree objects have a method getpath(element), which returns a
structural, absolute XPath expression to find that element
Calling getpath on each element in a iter() loop should work for you:
from pprint import pprint
from lxml import etree
text = """
<TechAccount Sender="broker" Receiver="insurer">
xml = etree.fromstring(text)
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
['/TechAccount, \n',
'/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
'/TechAccount/BrokerReference, HOY123/456',
'/TechAccount/ServiceProviderReference, 2012080921401A1',
'/TechAccount/CreationDate, 2012-08-10',
'/TechAccount/AccountTransactionType, premium',
'/TechAccount/GroupReference, 2012080921401A1',
'/TechAccount/ItemsInGroupTotal, \n',
'/TechAccount/ItemsInGroupTotal/Count, 1',
'/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
'/TechAccount/ServiceProviderGroupItemsTotal, \n',
'/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']
If your xml data is in the file test.xml, the code would look like:
from pprint import pprint
from lxml import etree
xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)
every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
Hope that helps.
getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:
def human_xpath(element):
full_xpath = element.getroottree().getpath(element)
xpath = ''
human_xpath = ''
for i, node in enumerate(full_xpath.split('/')[1:]):
xpath += '/' + node
element = element.xpath(xpath)[0]
namespace, tag = element.tag[1:].split('}', 1)
if element.getparent() is not None:
nsmap = {'ns': namespace}
same_name = element.getparent().xpath('./ns:' + tag,
if len(same_name) > 1:
tag += '[{}]'.format(same_name.index(element) + 1)
human_xpath += '/' + tag
return human_xpath