Get Document from Node or Element objects with minidom

Get Document from Node or Element objects with minidom - python

Is there a way I can get the document root from a child Element or Node? I am migrating from a library that works with any of Document, Element or Node to one that works only with Document. eg.
From:
element.xpath('/a/b/c') # 4Suite
to:
xpath.find('/a/b/c', doc) # pydomxpath

Node objects have an ownerDocument property that refers to the Document object associated with the node. See http://www.w3.org/TR/DOM-Level-2-Core/core.html#node-ownerDoc.
This property is not mentioned in the Python documentation, but it's available. Example:
from xml.dom import minidom
XML = """
<root>
<x>abc</x>
<y>123</y>
</root>"""
dom = minidom.parseString(XML)
x = dom.getElementsByTagName('x')[0]
print x
print x.ownerDocument
Output:
<DOM Element: x at 0xc57cd8>
<xml.dom.minidom.Document instance at 0x00C1CC60>

Related

xml parsing in python with XPath

I am trying to parse an XML file in Python with the built in xml module and Elemnt tree, but what ever I try to do according to the documentation, it does not give me what I need.
I am trying to extract all the value tags into a list
<?xml version="1.0" encoding="UTF-8"?>
<CustomField xmlns="http://soap.sforce.com/2006/04/metadata">
<fullName>testPicklist__c</fullName>
<externalId>false</externalId>
<label>testPicklist</label>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<type>Picklist</type>
<valueSet>
<restricted>true</restricted>
<valueSetDefinition>
<sorted>false</sorted>
<value>
<fullName>a 32</fullName>
<default>false</default>
<label>a 32</label>
</value>
<value>
<fullName>23 432;:</fullName>
<default>false</default>
<label>23 432;:</label>
</value>
and here is the example code that I cant get to work. It's very basic and all I have issues is the xpath.
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
print(root.findall(".//value")
print(root.findall(".//*/value")
print(root.findall("./*/value")

Since the root element has attribute xmlns="http://soap.sforce.com/2006/04/metadata", every element in the document will belong to this namespace. So you're actually looking for {http://soap.sforce.com/2006/04/metadata}value elements.
To search all <value> elements in this document you have to specify the namespace argument in the findall() function
from xml.etree.ElementTree import ElementTree
field_filepath= "./testPicklist__c.field-meta.xml"
mydoc = ElementTree()
mydoc.parse(field_filepath)
root = mydoc.getroot()
# get the namespace of root
ns = root.tag.split('}')[0][1:]
# create a dictionary with the namespace
ns_d = {'my_ns': ns}
# get all the values
values = root.findall('.//my_ns:value', namespaces=ns_d)
# print the values
for value in values:
print(value)
Outputs:
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043ba0>
<Element '{http://soap.sforce.com/2006/04/metadata}value' at 0x7fceea043e20>
Alternatively you can just search for the {http://soap.sforce.com/2006/04/metadata}value
# get all the values
values = root.findall('.//{http://soap.sforce.com/2006/04/metadata}value')

Is there a way to return the value for a tag from a XML based on a specific path in python?

I have this XML
<Body>
<Batch_Number>2000</Batch_Number>
<Total_No_Of_Batches>12312</Total_No_Of_Batches>
<requestNo>1923</requestNo>
<Parent1>
<Parent2>
<Parent3>
<lastModifiedDateTime>2022-11-11T11:07:30.000</lastModifiedDateTime>
<purpose>NeverMore</purpose>
<endDate>9999-12-31T00:00:00.000</endDate>
<createdDateTime>2019-06-06T06:32:16.000</createdDateTime>
<createdOn>2019-06-06T08:32:16.000</createdOn>
<address2>Forever street 21</address2>
<externalCode>code123</externalCode>
<lastModifiedBy>user2.thisUser</lastModifiedBy>
<lastModifiedOn>2039-06-11T13:07:30.000</lastModifiedOn>
<lastModifiedBy>MG</lastModifiedBy>
<PS>1234431</PS>
</Parent3>
</Parent2>
</Parent1>
</Body>
Is there a way to return the value for lastModifiedBy for example where the path has this specific structure :
Body.Parent1.Parent2.Parent3.lastModifiedBy
Idealy, I would like to populate a dictionary with the child tag name and its value, for example :
dict[lastModifiedBy.tag] = lastModifiedBy.text

You can use xml from standart library for working with xml files.
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
And then you can access elements as indexes or you can use root like as a list:
for i in root:
print(i)
A XML element may have more than one child with same tag name (even you have two lastModifiedBy in the Parent3). This is why we use them like lists, they works like a list. So you shouldn't try to use them like dictionary.
I think you need to use XPath. Like so:
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
s = root.findall(".Parent1/Parent2/Parent3/lastModifiedBy")
for i in s:
print(i.text)
This gives you all lastModifiedBy elements in the Parent3 element. You can access to any index if you want too, like this:
from xml.etree import ElementTree as ET
tree = ET.parse("d.xml") # our xml file
root = tree.getroot()
s = root.find(".Parent1/Parent2/Parent3/lastModifiedBy[1]") # first element with "lastModifiedBy" tag
print(s.text)

Remove element in a XML file with Python

I'm a newbie with Python and I'd like to remove the element openingHours and the child elements from the XML.
I have this input
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>05:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<station id= "2">
<name>foo</name>
<openingHours>
<openingHour>
<entrance>main</entrance>
<timeInterval>
<from>06:30</from>
<to>21:30</to>
</timeInterval>
<openingHour/>
<openingHours>
<station/>
<stations/>
<Root/>
I'd like this output
<Root>
<stations>
<station id= "1">
<name>whatever</name>
<station/>
<station id= "2">
<name>foo</name>
<station/>
<stations/>
<Root/>
So far I've tried this from another thread How to remove elements from XML using Python
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//*[attribute::openingHour]'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))
However, It doesn't seem to be working.
Thanks

I took your code for a spin but at first Python couldn't agree with the way you composed your XML, wanting the / in the closing tag to be at the beginning (like </...>) instead of at the end (<.../>).
That aside, the reason your code isn't working is because the xpath expression is looking for the attribute openingHour while in reality you want to look for elements called openingHours. I got it to work by changing the expression to //openingHours. Making the entire code:
from lxml import etree
doc=etree.parse('stations.xml')
for elem in doc.xpath('//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))

You want to remove the tags <openingHours> and not some attribute with name openingHour:
from lxml import etree
doc = etree.parse('stations.xml')
for elem in doc.findall('.//openingHours'):
parent = elem.getparent()
parent.remove(elem)
print(etree.tostring(doc))

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source

This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Python xml etree find parent node by text of child

I have an XML that's like this
<xml>
<access>
<user>
<name>user1</name>
<group>testgroup</group>
</user>
<user>
<name>user2</name>
<group>testgroup</group>
</user>
<access>
</xml>
I now want to add a <group>testgroup2</group> to the user1 subtree.
Using the following I can get the name
access = root.find('access')
name = [element for element in access.iter() if element.text == 'user1']
But I can't access the parent using name.find('..') it tells me
AttributeError: 'list' object has no attribute 'find'.
Is there any possibility to access the exact <user> child of <access> where the text in name is "user1"?
Expected result:
<xml>
<access>
<user>
<name>user1</name>
<group>testgroup</group>
<group>testgroup2</group>
</user>
<user>
<name>user2</name>
<group>testgroup</group>
</user>
<access>
</xml>
Important notice: I can NOT use lxml to use getparent() method, I am stuck to xml.etree

To do that, using 'find', you need to do like this: for ele in name:
ele.find('..') # To access ele as an element

Here is how I solved this, if anyone is interested in doing this stuff in xml instead of lxml (why ever).
According to suggestion from
http://effbot.org/zone/element.htm#accessing-parents
import xml.etree.ElementTree as et
tree = et.parse(my_xmlfile)
root = tree.getroot()
access = root.find('access')
# ... snip ...
def iterparent(tree):
for parent in tree.getiterator():
for child in parent:
yield parent, child
# users = list of user-names that need new_group added
# iter through tupel and find the username
# alter xml tree when found
for user in users:
print "processing user: %s" % user
for parent, child in iterparent(access):
if child.tag == "name" and child.text == user:
print "Name found: %s" % user
parent.append(et.fromstring('<group>%s</group>' % new_group))
After this et.dump(tree) shows that tree now contains the correctly altered user-subtree with another group tag added.
Note: I am not really sure why this works, I just expect that yield gives a reference to the tree and therefore altering the parent yield returned alters the original tree. My python knowledge is not good enough to be sure about this tho. I just know that it works for me this way.

You can write a recursive method to iterate through the tree and capture the parents.
def recurse_tree(node):
for child in node.getchildren():
if child.text == 'user1':
yield node
for subchild in recurse_tree(child):
yield subchild
print list(recurse_tree(root))
# [<Element 'user' at 0x18a1470>]
If you're using Python 3.X, you can use the nifty yield from ... syntax rather than iterating over the recursive call.
Note that this could possibly yield the same node more than once (if there are multiple children containing the target text). You can either use a set to remove duplicates, or you can alter the control flow to prevent this from happening.

you can directly use findall() method to get the parent node that match the name='user1'. see below code
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml') #build tree object using your xml
root = tree.getroot() #using tree object get the root
for parent in root.findall(".//*[name='user1']"):
# the predicate [name='user1'] preceded by asterisk will give
# all elements where child having name='user1'
parent.append(ET.fromstring("<group>testgroup2</group>"))
# if you want to see the xml after adding the string
ET.dump(root)
# optionally to save the xml
tree.write('output.xml')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Document from Node or Element objects with minidom - python

Is there a way I can get the document root from a child Element or Node? I am migrating from a library that works with any of Document, Element or Node to one that works only with Document. eg. From: element.xpath('/a/b/c') # 4Suite to: xpath.find('/a/b/c', doc) # pydomxpath

Related

xml parsing in python with XPath

Is there a way to return the value for a tag from a XML based on a specific path in python?

Remove element in a XML file with Python

Python -lxml xpath returns empty list

Python xml etree find parent node by text of child

Categories

Resources