Replace XML element with another element in Python - python

I need to replace a particular element from one XML file with another element from a different XML file. I get the element with XPath expressions and I don't have a handle to its parent.
What is the easiest way to in-place replace it so that if I write to a XML file it would reflect the change? I.e. I want to do what this pseudocode does:
# Pseudocode
tree1.open('input1.xml')
tree2.open('input2.xml')
element1 = tree1.findall(...)[0]
element2 = tree2.findall(...)[0]
element1.replaceWith(element2)
tree1.writeToXmlFile('merged.xml')

Ok, I tried __setstate__ and __getstate__ and it worked:
element1.__setstate__(element2.__getstate__())

Related

Why does adding 3 elements make findall not work?

For parsing information from this url: http://py4e-data.dr-chuck.net/comments_42.xml
url = "http://py4e-data.dr-chuck.net/comments_42.xml"
fhandle = urllib.request.urlopen(url, context=ctx)
string_data = fhandle.read()
xml = ET.fromstring(string_data)
Why does
lst = xml.findall("./commentinfo/comments/comment")
Not put anything into lst while
lst = xml.findall("comments/comment")
creates a list of elements.
Thanks!
Element.findall uses a subset of the XPATH specification (see XPATH support) based on the element you are referencing. When you loaded the document, you referenced the root element <commentinfo>. An XPATH comments/comment selects all of that element's child elements named "comments" then selects all of their children named "comment".
./comments/comment is identical to comments/comment. "." is the current node (<commentinfo>) and the following "/comments" selects its child nodes as above.
./commentinfo/comments/comment is the same as commentinfo/comments/comment. It's easy to see the issue. Since you are already on the <commentinfo> node, there aren't any child elements also named "commentinfo". Some XPATH processors would let you reference from the root of the tree, as in //commentinfo/comments/comment but ElementTree doesn't do that.
'.' in the XPath already means the top-level element, here <commentinfo>. So your path is looking for a <commentinfo> child of that, which doesn't exist.
You can see this by cross-referencing the example from the documentation with the corresponding XML. Notice how none of the example XPaths mention data.
You want just './comments/comment'.

Getting element by undefined tag name

I'm parsing an xml document in Python using minidom.
I have an element:
<informationRequirement>
<requiredDecision href="id"/>
</informationRequirement>
The only thing I need is value of href in subelement but its tag name can be different (for example requiredKnowledge instead of requiredDecision; it always shall begin with required).
If the tag was always the same I would use something like:
element.getElementsByTagName('requiredDecision')[0].attributes['href'].value
But that's not the case. What can be substitute of this knowing that tag name varies?
(there will be always one subelement)
If you're always guaranteed to have one subelement, just grab that element:
element.childNodes[0].attributes['href'].value
However, this is brittle. A (perhaps) better approach could be:
hrefs = []
for child in element.childNodes:
if child.tagName.startswith('required'):
hrefs.append(child.attributes['href'].value)

Getting XML attribute value with lxml module

How can i get the value of an attribute of XML file with lxml module?
My XML looks like this"
<process>
<name>somename</name>
<statistics>
<stats param='someparam'>
<value>0.456</value>
<real_value>0.4</value>
</stats>
<stats ...>
.
.
.
</stats>
</statistics>
</process>
I want to get the value 0.456 from the value attribute. I'm iterating trought the attribute and getting the text but im not sure that this is the best way for doing this
for attribute in root.iter('statistics'):
for stats in attribute:
for param_value in stats.iter('value'):
value = param_value.text
is there any other much easier way for doing this? something like stats.get_value('value')
Use XPath:
root.find('.//value').text
This gets you the content of the first value tag.
If you want to iterate over all value elements, use findall, this gets you a list with all the elements.
If you only want the value elements inside <stats param='someparam'> elements, make the path more specific:
root.findall("./statistics/stats[#param='someparam']/value")
edit: Note that find/findall only support a subset of XPath. If you want to make use of the whole XPath (1.x) functionality, use the xpath method.

How can I get the only element of a certain type out of a list more cleanly than what I am doing?

I am working with some xml files. The schema for the files specifies that there can only be one of a certain type of element (in this case I am working with the footnotes element).
There can be several footnote elements in the footnotes element, I am trying to grab and process the footnotes element so that I can iterate through it to discover the footnote elements.
here is my current approach
def get_footnotes(element_list):
footnoteDict=od()
footnotes_element=[item for item in element_list if item.tag=='footnotes'][0]
for eachFootnote in footnotes_element.iter():
if eachFootnote.tag=='footnote':
footnoteDict[eachFootnote.values()[0]]=eachFootnote.text
return footnoteDict
element_list is a list of elements that are relevant for me after iterating through the entire tree
So I am wondering if there is a more pythonic way to get the footnotes element instead of iterating through the list of elements it seems to me that this is clumsy with this being
footnotes_element=[item for item in element_list if item.tag=='footnotes'][0]
Something like this should do the job:
from lxml import etree
xmltree = etree.fromstring(your_xml)
for footnote in xmltree.iterfind("//footnotes/footnote"):
# do something
pass
It's easier to help if you provide some sample XML.
Edit:
If you are working with really big files, you might want to look into iterparse.
This question seems to have quite a nice example: python's lxml and iterparse method

How to access comments using lxml

I am trying to remove comments from a list of elements that were obtained by using lxml
The best I have been able to do is:
no_comments=[element for element in element_list if 'HtmlComment' not in str(type(each))]
I am wondering if there is a more direct way?
I am going to add something based on Matthew's answer - he got me almost there the problem is that when the element are taken from the tree the comments lose some identity (I don't know how to describe it) so that it cannot be determined whether they are HtmlComment class objects using the isinstance() method
However, that method can be used when the elements are being iterated through on the tree
from lxml.html import HtmlComment
no_comments=[element for element in root.iter() if not isinstance(element,HtmlComment)
For those novices like me root is the base html element that holds all of the other elements in the tree there are a number of ways to get it. One is to open the file and iterate through it so instead of root.iter() in the above
html.fromstring(open(r'c:\temp\testlxml.htm').read()).iter()
You can cut out the strings:
from lxml.html import HtmlComment # or similar
no_comments=[element for element in element_list if not isinstance(element, HtmlComment)]

Categories

Resources