Preserve comments outside the root element of xml file - python

I am updating the xml file using python xml module and I want to preserve all the comments, using insert_comment variable is preserving the comments inside root element but removing outside/end of the xml file.
Is there any way to save those comments along with inner ones?
Below is the answers I found.
Python 3.8 added the insert_comments argument to TreeBuilder which:
class xml.etree.ElementTree.TreeBuilder(element_factory=None, *, comment_factory=None, pi_factory=None, insert_comments=False, insert_pis=False)
When insert_comments and/or insert_pis is true, comments/pis will be inserted into the tree if they appear within root element the (but not outside of it).

As your question tag contains "elementtree", this answer is specific about that.
The solution you are expecting is may not be possible due to default design of ElementTree XMLParser. The parser skips over the comments. Look into below official documented snippet.
Note that XMLParser skips over comments in the input instead of
creating comment objects for them. An ElementTree will only contain
comment nodes if they have been inserted into to the tree using one of
the Element methods.
Please find the above content in below documentation link.
https://docs.python.org/3/library/xml.etree.elementtree.html
search for xml.etree.ElementTree.Comment in above link.

Related

ElementTree not preserving attribute order

i tried to reserve xml elementree in this format
from lxml import etree
box = etree.SubElement(image, 'box',top=t,left=l,width=w,height=h)
but what i got is
<box height="511" left="1" top="1" width="510">
I noticed that the attribute are in alphabetical order
so what should I do to put it in order that I did?
In XML, the order of attributes is of no concern by specification. So it is not mandatory to keep the order of the attributes.
Your only option would be testing other XML parsers and maybe you'll find one that doesn't change the order (what depends on the inner implementation).
This SO answer elaborates on that.

Python: Using namespaces in ElementTree with iter()

I am using xml.etree.ElementTree to parse some complex xml files. Some of the xml files have several repeating tags nested in them.
<product:object>
<product:parent>
<product:parent>
<product:parent>
</product:parent>
</product:parent>
</product:parent>
</product:object>
I am using .iter() to find the repeating tag in different layers. Normally a second argument can be passed to .find() and .findall(). However, for some reason .iter() doesn't have this option.
Am I missing something, or is there another way of properly doing this?
I know how to, and have build workarounds.
e.g.
- A definition that is reiterated and passes the parent element.
- Manually mapping the namespaces
I am hoping there is a better way!?
I found that using XPath syntax .findall(.//product:parent, ns) can be used as a substitute for .iter()

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?
Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests

Chosing next relative in Python BeautifulSoup with automation

First of all - I'm creating xml document with python BeautifulSoup.
Currently, what I'm trying to create, is very similar to this example.
<options>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
</options>
Notice, that there should be only one tag, called name.
As options can be much more in count, and different as well, I decided to create little python function, which could help me create such result.
array = ['FirstName','SecondName','ThirdName']
# This list will be guideline for function to let it know, how much options will be in result, and how option tags will be called.
def create_options(array):
soup.append(soup.new_tag('options'))
if len(array) > 0: # It's small error handling, so you could see, if given array isn't empty by any reason. Optional.
for i in range(len(array)):
soup.options.append(soup.new_tag('opt'))
# With beatifullsoup methods, we create opt tags inside options tag. Exact amount as in parsed array.
counter = 0
# There's option to use python range() method, but for testing purposes, current approach is sufficient enough.
for tag in soup.options.find_all():
soup.options.find('opt')['name'] = str(array[counter])
# Notice, that in this part tag name is assigned only to first opt element. We'll discuss this next.
counter += 1
print len(array), ' options were created.'
else:
print 'No options were created.'
You notice, that in function, tag assignment is handled by for loop, which, unfortunately, assigns all different tag names to first option in options element.
BeautifulSoup has .next_sibling and .previous_sibling, which can help me in this task.
As they describe by name, with them I can access next or previous sibling in element. So, by this example:
soup.options.find('opt').next_sibling['name'] = str(array[counter])
We can access second child of options element. So, if we add .next_sibling to each soup.items.find('opt'), we could then move from first element to next.
Problem is, that by finding option element in options with:
soup.options.find('opt')
each time we access first option. But my function is willing to access with each item in list, next option as well. So it means, as more items are in list, more .next_sibling methods it must add to first option.
In result, with logic I constructed, with 4th or further item in list, accessing relevant option for assigning it's appropriate tag, should look like this:
soup.options.find('opt').next_sibling.next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
And now we are ready to my questions:
1st. As I didn't found any other kind of method, how to do it with Python BeautifulSoup methods, I'm not sure, that my approach still is only way. Is there any other method?
2st. How could I achieve result by this approach, if as my experiments show me, that I can't put variable inside method row? (So I could multiply methods)
#Like this
thirdoption = .next_sibling.next_sibling.next_sibling
#As well, it's not quite possible, but it's just example.
soup.options.find('opt').next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
3st. May be I read BeautifulSoup documentation badly, and just didn't found method, which could help me in this task?
I managed to achieve result, ignoring BeatifulSoup methods.
Python has element tree methods, which were sufficient enough to work with.
So, let me show the example code, and explain it, what it does. Comments provide explanation more precisely.
"""
Before this code, there goes soup xml document generation. Except part, I mentioned in topic, we just create empty options tags in document, thus, creating almost done document.
Right after that, with this little script, we will use basic python provided element tree methods.
"""
import xml.etree.ElementTree as ET
ET_tree = ET.parse("exported_file.xml")
# Here we import exactly the same file, we opened with soup. Exporting can be done in different file, if you wish.
ET_root = ET_tree.getroot()
for position, opt in enumerate(item.find('options')):
# Position is pretty important, as this will remove 'counter' thing in for loop, I was using in code in first example. Position will be used for getting out exact items from array, which works like template for our option tag names.
opt.set('name', str(array[position]))
opt.text = 'text'
# Same way, with position, we can get data from relevant array, provided, that they are inherited or connected in same way.
tree = ET.ElementTree(ET_root).write('exported_file.xml',encoding="UTF-8",xml_declaration=True)
# This part was something, I researched before quite lot. This code will help save xml document with utf-8 encoding, which is very important.
This approach is pretty inefficient, as for achieving same result, I could use ET for everything.
Thought, BeatifulSoup prepares document in nice output, which in any way is very neat, as element-tree creates files for software friendly only look.

Search for an XML node in a parent by string with python

I'm working with python xml.dom. I'm looking for a particular method that takes in a node and string and returns the xml node that is is named string. I can't find it in the documentation
I'm thinking it would work something like this
nodeObject =parent.FUNCTION('childtoFind')
where the nodeObject is under the parent
Or barring the existence of such a method, is there a way I can make the string a node object?
You are looking for the .getElementsByTagname() function:
nodeObjects = parent.getElementsByTagname('childtoFind')
It returns a list; if you need only one node, use indexing:
nodeObject = parent.getElementsByTagname('childtoFind')[0]
You really want to use the ElementTree API instead, it's easier to use. Even the minidom documentation makes this recommendation:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
The ElementTree API has a .find() function that let's you find the first matching descendant:
element = parent.find('childtoFind')

Categories

Resources