In python-docx, the paragraph object has a method insert_paragraph_before that allows inserting text before itself:
p.insert_paragraph_before("This is a text")
There is no insert_paragraph_after method, but I suppose that a paragraph object knows sufficiently about itself to determine which paragraph is next in the list. Unfortunately, the inner workings of the python-docx AST are a little intricate (and not really documented).
I wonder how to program a function with the following spec?
def insert_paragraph_after(para, text):
Trying to make sense of the inner workings of docx made me dizzy, but fortunately, it's easy enough to accomplish what you want, since the internal object already has the necessairy method addnext, which is all we need:
from docx import Document
from docx.text.paragraph import Paragraph
from docx.oxml.xmlchemy import OxmlElement
def insert_paragraph_after(paragraph, text=None, style=None):
"""Insert a new paragraph after the given paragraph."""
new_p = OxmlElement("w:p")
paragraph._p.addnext(new_p)
new_para = Paragraph(new_p, paragraph._parent)
if text:
new_para.add_run(text)
if style is not None:
new_para.style = style
return new_para
def main():
# Create a minimal document
document = Document()
p1 = document.add_paragraph("First Paragraph.")
p2 = document.add_paragraph("Second Paragraph.")
# Insert a paragraph wedged between p1 and p2
insert_paragraph_after(p1, "Paragraph One And A Half.")
# Test if the function succeeded
document.save(r"D:\somepath\docx_para_after.docx")
if __name__ == "__main__":
main()
Please refer below details:
para1 = document.add_paragraph("Hello World")
para2 = document.add_paragraph("Testing!!")
p1 = para1._p
p1.addnext(para2._p)
Reference
In the mean time I found another method, more high level (though perhaps not as elegant). It essentially finds the parent, lists the children, works out its own position in line and then gets the next one.
def par_index(paragraph):
"Get the index of the paragraph in the document"
doc = paragraph._parent
# the paragraphs elements are being generated on the fly,
# they change all the time
# so in order to index, we must use the elements
l_elements = [p._element for p in doc.paragraphs]
return l_elements.index(paragraph._element)
def insert_paragraph_after(paragraph, text, style=None):
"""
Add a paragraph to a docx document, after this one.
"""
doc = paragraph._parent
i = par_index(paragraph) + 1 # next
if i <= len(doc.paragraphs):
# we find the next paragraph and we insert before:
next_par = doc.paragraphs[i]
new_par = next_par.insert_paragraph_before(text, style)
else:
# we reached the end, so we need to create a new one:
new_par = parent.add_paragraph(text, style)
return new_par
One advantage is that it mostly avoids getting into the inner workings.
Related
This is the input sample text. I want to do in object based cleanup to avoid hierarchy issues
<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>
Required Output
<p><b><i>sample text</i></b></p>
I written this Object based cleanup using lxml for sublevel duplicate tags. It may help others.
import lxml.etree as ET
textcont = '<p><b><b><i><b><i><b><i>sample text</i></b></i></b></i></b></b></p>'
soup = ET.fromstring(textcont)
for tname in ['i','b']:
for tagn in soup.iter(tname):
if tagn.getparent().getparent() != None and tagn.getparent().getparent().tag == tname:
iparOfParent = tagn.getparent().getparent()
iParent = tagn.getparent()
if iparOfParent.text == None:
iparOfParent.addnext(iParent)
iparOfParent.getparent().remove(iparOfParent)
elif tagn.getparent() != None and tagn.getparent().tag == tname:
iParent = tagn.getparent()
if iParent.text == None:
iParent.addnext(tagn)
iParent.getparent().remove(iParent)
print(ET.tostring(soup))
output:
b'<p><b><i>sample text</i></b></p>'
Markdown, itself, provides structural to extract elements inside
Using re in python, you may extract elements and recombine them.
For example:
import re
html = """<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>"""
regex_object = re.compile("\<(.*?)\>")
html_objects = regex_object.findall(html)
set_html = []
for obj in html_objects:
if obj[0] != "/" and obj not in set_html:
set_html.append(obj)
regex_text = re.compile("\>(.*?)\<")
text = [result for result in regex_text.findall(html) if result][0]
# Recombine
result = ""
for obj in set_html:
result += f"<{obj}>"
result += text
for obj in set_html[::-1]:
result += f"</{obj}>"
# result = '<p><b><i>sample text</i></b></p>'
You can use the regex library re to create a function to search for the matching opening tag and closing tag pair and everything else in between. Storing tags in a dictionary will remove duplicate tags and maintain the order they were found in (if order isn't important then just use a set). Once all pairs of tags are found, wrap what's left with the keys of the dictionary in reverse order.
import re
def remove_duplicates(string):
tags = {}
while (match := re.findall(r'\<(.+)\>([\w\W]*)\<\/\1\>', string)):
tag, string = match[0][0], match[0][1] # match is [(group0, group1)]
tags.update({tag: None})
for tag in reversed(tags):
string = f'<{tag}>{string}</{tag}>'
return string
Note: I've used [\w\W]* as a cheat to match everything.
Is it possible in python to pretty print the root's attributes?
I used etree to extend the attributes of the child tag and then I had overwritten the existing file with the new content. However during the first generation of the XML, we were using a template where the attributes of the root tag were listed one per line and now with the etree I don't manage to achieve the same result.
I found similar questions but they were all referring to the tutorial of etree, which I find incomplete.
Hopefully someone has found a solution for this using etree.
EDIT: This is for custom XML so HTML Tidy (which was proposed in the comments), doesn't work for this.
Thanks!
generated_descriptors = list_generated_files(generated_descriptors_folder)
counter = 0
for g in generated_descriptors:
if counter % 20 == 0:
print "Extending Descriptor # %s out of %s" % (counter, len(descriptor_attributes))
with open(generated_descriptors_folder + "\\" + g, 'r+b') as descriptor:
root = etree.XML(descriptor.read(), parser=parser)
# Go through every ContextObject to check if the block is mandatory
for context_object in root.findall('ContextObject'):
for attribs in descriptor_attributes:
if attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'false')
elif attribs['descriptor_name'] == g[:-11] and context_object.attrib['name'] not in attribs['attributes']['mandatoryobjects']:
context_object.set('allow-null', 'true')
# Sort the ContextObjects based on allow-null and their name
context_objects = root.findall('ContextObject')
context_objects_sorted = sorted(context_objects, key=lambda c: (c.attrib['allow-null'], c.attrib['name']))
root[:] = context_objects_sorted
# Remove mandatoryobjects from Descriptor attributes and pretty print
root.attrib.pop("mandatoryobjects", None)
# paste new line here
# Convert to string in order to write the enhanced descriptor
xml = etree.tostring(root, pretty_print=True, encoding="UTF-8", xml_declaration=True)
# Write the enhanced descriptor
descriptor.seek(0) # Set cursor at beginning of the file
descriptor.truncate(0) # Make sure that file is empty
descriptor.write(xml)
descriptor.close()
counter+=1
How can a docx table be indented? I am trying to line a table up with a tab stop set at 2cm. The following script creates a header, some text and a table:
import docx
from docx.shared import Cm
doc = docx.Document()
style = doc.styles['Normal']
style.paragraph_format.tab_stops.add_tab_stop(Cm(2))
doc.add_paragraph('My header', style='Heading 1')
doc.add_paragraph('\tText is tabbed')
# This indents the paragraph inside, not the table
# style = doc.styles['Table Grid']
# style.paragraph_format.left_indent = Cm(2)
table = doc.add_table(rows=0, cols=2, style="Table Grid")
for rowy in range(1, 5):
row_cells = table.add_row().cells
row_cells[0].text = 'Row {}'.format(rowy)
row_cells[0].width = Cm(5)
row_cells[1].text = ''
row_cells[1].width = Cm(1.2)
doc.save('output.docx')
It produces a table with no ident as follows:
How can the table be indented as follows?
(preferably without having to load an existing document):
If for example left-indent is added to the Table Grid style (by uncommenting the lines), it will be applied at the paragraph level, not the table level resulting in the following (which is not wanted):
In Microsoft Word, this can be done on the table properties by entering 2.0 cm for Indent from left.
Based on Fred C's answer, I came up with this solution:
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
def indent_table(table, indent):
# noinspection PyProtectedMember
tbl_pr = table._element.xpath('w:tblPr')
if tbl_pr:
e = OxmlElement('w:tblInd')
e.set(qn('w:w'), str(indent))
e.set(qn('w:type'), 'dxa')
tbl_pr[0].append(e)
This feature is not yet supported by python-docx. It looks like this behavior is produced by the w:tblInd child of the w:tbl element. It's possible you could develop a workaround function to add an element like this using lxml calls on the w:tbl element, which should be available on the ._element attribute of a Table object.
You can find examples of other workaround functions by searching on 'python-docx workaround function' and similar ones by searching on 'python-pptx workaround functions'.
Here's how I did it:
import docx
import lxml
mydoc = docx.Document()
mytab = self.mydoc.add_table(3,3)
nsmap=mytab._element[0].nsmap # For namespaces
searchtag='{%s}tblPr' % nsmap['w'] # w:tblPr
mytag='{%s}tblInd' % nsmap['w'] # w:tblInd
myw='{%s}w' % nsmap['w'] # w:w
mytype='{%s}type' % nsmap['w'] # w:type
for elt in mytab._element:
if elt.tag == searchtag:
myelt=lxml.etree.Element(mytag)
myelt.set(myw,'1000')
myelt.set(mytype,'dxa')
myelt=elt.append(myelt)
I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
Just use the node.itertext() method, as in:
''.join(node.itertext())
Does text_content() do what you need?
Try:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
Example:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'
A version of albertov 's stringify-content that solves the bugs reported by hoju:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
return ''.join(
chunk for chunk in chain(
(node.text,),
chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
(node.tail,)) if chunk)
The following snippet which uses python generators works perfectly and is very efficient.
''.join(node.itertext()).strip()
Defining stringify_children this way may be less complicated:
from lxml import etree
def stringify_children(node):
s = node.text
if s is None:
s = ''
for child in node:
s += etree.tostring(child, encoding='unicode')
return s
or in one line
return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))
Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.
Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:
def stringify_children(node):
s = etree.tostring(node, encoding='unicode', with_tail=False)
return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]
which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.
One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is
etree.tostring(html, method="text")
where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.
import urllib2
from lxml import etree
url = 'some_url'
getting url
test = urllib2.urlopen(url)
page = test.read()
getting all html code within including table tag
tree = etree.HTML(page)
xpath selector
table = tree.xpath("xpath_here")
res = etree.tostring(table)
res is the html code of table
this was doing job for me.
so you can extract the tags content with xpath_text() and tags including their content using tostring()
div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content")
or text = tree.xpath("//content/text()")
div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')
this last line with strip method using is not nice, but it just works
Just a quick enhancement as the answer has been given. If you want to clean the inside text:
clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()
In response to #Richard's comment above, if you patch stringify_children to read:
parts = ([node.text] +
-- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++ list(chain(*([tostring(c)] for c in node.getchildren()))) +
[node.tail])
it seems to avoid the duplication he refers to.
I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:
def stringify_children(node):
"""Given a LXML tag, return contents as a string
>>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
>>> node = lxml.html.fragment_fromstring(html)
>>> extract_html_content(node)
"<strong>Sample sentence</strong> with tags."
"""
if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
return ""
node.attrib.clear()
opening_tag = len(node.tag) + 2
closing_tag = -(len(node.tag) + 3)
return lxml.html.tostring(node)[opening_tag:closing_tag]
Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.
Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.
import re
from lxml import etree
def _tostr_with_tags(parent_element, html_entities=False):
RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'
content_with_parent = etree.tostring(parent_element)
def _replace_html_entities(s):
RE_ENTITY = r'&#(\d+);'
def repl(m):
return unichr(int(m.group(1)))
replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)
return replaced
if not html_entities:
content_with_parent = _replace_html_entities(content_with_parent)
content_with_parent = content_with_parent.strip() # remove 'white' characters on margins
start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]
if start_tag != end_tag:
raise Exception('Start tag does not match to end tag while getting content with tags.')
return content_without_parent
parent_element must have Element type.
Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.
lxml have a method for that:
node.text_content()
If this is an a tag, you can try:
node.values()
import re
from lxml import etree
node = etree.fromstring("""
<content>Text before inner tag
<div>Text
<em>inside</em>
tag
</div>
Text after inner tag
</content>""")
print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)
Consider a reStructuredText document with this skeleton:
Main Title
==========
text text text text text
Subsection
----------
text text text text text
.. my-import-from:: file1
.. my-import-from:: file2
The my-import-from directive is provided by a document-specific Sphinx extension, which is supposed to read the file provided as its argument, parse reST embedded in it, and inject the result as a section in the current input file. (Like autodoc, but for a different file format.) The code I have for that, right now, looks like this:
class MyImportFromDirective(Directive):
required_arguments = 1
def run(self):
src, srcline = self.state_machine.get_source_and_line()
doc_file = os.path.normpath(os.path.join(os.path.dirname(src),
self.arguments[0]))
self.state.document.settings.record_dependencies.add(doc_file)
doc_text = ViewList()
try:
doc_text = extract_doc_from_file(doc_file)
except EnvironmentError as e:
raise self.error(e.filename + ": " + e.strerror) from e
doc_section = nodes.section()
doc_section.document = self.state.document
# report line numbers within the nested parse correctly
old_reporter = self.state.memo.reporter
self.state.memo.reporter = AutodocReporter(doc_text,
self.state.memo.reporter)
nested_parse_with_titles(self.state, doc_text, doc_section)
self.state.memo.reporter = old_reporter
if len(doc_section) == 1 and isinstance(doc_section[0], nodes.section):
doc_section = doc_section[0]
# If there was no title, synthesize one from the name of the file.
if len(doc_section) == 0 or not isinstance(doc_section[0], nodes.title):
doc_title = nodes.title()
doc_title.append(make_title_text(doc_file))
doc_section.insert(0, doc_title)
return [doc_section]
This works, except that the new section is injected as a child of the current section, rather than a sibling. In other words, the example document above produces a TOC tree like this:
Main Title
Subsection
File1
File2
instead of the desired
Main Title
Subsection
File1
File2
How do I fix this? The Docutils documentation is ... inadequate, particularly regarding control of section depth. One obvious thing I have tried is returning doc_section.children instead of [doc_section]; that completely removes File1 and File2 from the TOC tree (but does make the section headers in the body of the document appear to be for the right nesting level).
I don't think it is possible to do this by returning the section from the directive (without doing something along the lines of what Florian suggested), as it will get appended to the 'current' section. You can, however, add the section via self.state.section as I do in the following (handling of options removed for brevity)
class FauxHeading(object):
"""
A heading level that is not defined by a string. We need this to work with
the mechanics of
:py:meth:`docutils.parsers.rst.states.RSTState.check_subsection`.
The important thing is that the length can vary, but it must be equal to
any other instance of FauxHeading.
"""
def __init__(self, length):
self.length = length
def __len__(self):
return self.length
def __eq__(self, other):
return isinstance(other, FauxHeading)
class ParmDirective(Directive):
required_arguments = 1
optional_arguments = 0
has_content = True
option_spec = {
'type': directives.unchanged,
'precision': directives.nonnegative_int,
'scale': directives.nonnegative_int,
'length': directives.nonnegative_int}
def run(self):
variableName = self.arguments[0]
lineno = self.state_machine.abs_line_number()
secBody = None
block_length = 0
# added for some space
lineBlock = nodes.line('', '', nodes.line_block())
# parse the body of the directive
if self.has_content and len(self.content):
secBody = nodes.container()
block_length += nested_parse_with_titles(
self.state, self.content, secBody)
# keeping track of the level seems to be required if we want to allow
# nested content. Not sure why, but fits with the pattern in
# :py:meth:`docutils.parsers.rst.states.RSTState.new_subsection`
myLevel = self.state.memo.section_level
self.state.section(
variableName,
'',
FauxHeading(2 + len(self.options) + block_length),
lineno,
[lineBlock] if secBody is None else [lineBlock, secBody])
self.state.memo.section_level = myLevel
return []
I don't know how to do it directly inside your custom directive. However, you can use a custom transform to raise the File1 and File2 nodes in the tree after parsing. For example, see the transforms in the docutils.transforms.frontmatter module.
In your Sphinx extension, use the Sphinx.add_transform method to register the custom transform.
Update: You can also directly register the transform in your directive by returning one or more instances of the docutils.nodes.pending class in your node list. Make sure to call the note_pending method of the document in that case (in your directive you can get the document via self.state_machine.document).