get attribute from xml-node with specific value - python

I have an XSD-file where I need to get a namespace as defined in the root-tag:
<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:abw="http://www.liegenschaftsbestandsmodell.de/ns/abw/1.0.1.0" xmlns:adv="http://www.adv-online.de/namespaces/adv/gid/6.0" xmlns:bfm="http://www.liegenschaftsbestandsmodell.de/ns/bfm/1.0" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:sc="http://www.interactive-instruments.de/ShapeChange/AppInfo" elementFormDefault="qualified" targetNamespace="http://www.liegenschaftsbestandsmodell.de/ns/abw/1.0.1.0" version="1.0.1.0">
<!-- elements -->
</schema>
Now as the targetNamespace of this schema-definition is "http://www.liegenschaftsbestandsmodell.de/ns/abw/1.0.1.0" I need to get the short identifier for this namespace - which is abw. To get this identifier I have to get that attribute from the root-tag that has the exact same value as my targetNamespace (I can´t rely on the identifier beeing part of the targetNamespace-string allready, this may change in the future).
On this question How to extract xml attribute using Python ElementTree I found how to get the value of an attribute given by its name. However I don´t know the attributes name, only its value, so what can I do when I have a value and want to select the attribute having this value?
I think of something like this:
for key in root.attrib.keys():
if(root.attrib[key] == targetNamespace):
return root.attrib[key]
but root.attrib only contains elementFormDefault, targetNamespace and version, but not xmlns:abw.

string must be Unicode else error will appear
Traceback (most recent call last):
File "<pyshell#62>", line 1, in <module>
it = etree.iterparse(StringIO(xml))
TypeError: initial_value must be unicode or None, not str
code:
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> xml=u"""<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:abw="http://www.liegenschaftsbestandsmodell.de/ns/abw/1.0.1.0" xmlns:adv="http://www.adv-online.de/namespaces/adv/gid/6.0" xmlns:bfm="http://www.liegenschaftsbestandsmodell.de/ns/bfm/1.0" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:sc="http://www.interactive-instruments.de/ShapeChange/AppInfo" elementFormDefault="qualified" targetNamespace="http://www.liegenschaftsbestandsmodell.de/ns/abw/1.0.1.0" version="1.0.1.0">
<!-- elements -->
</schema>"""
>>> ns = dict([
node for _, node in ElementTree.iterparse(
StringIO(xml), events=['start-ns']
)
])
>>> for k,v in ns.iteritems():
if v=='http://www.liegenschaftsbestandsmodell.de/ns/abw/1.0.1.0':
print k
output:
abw

Using minidom instead of ETree did it:
import xml.dom.minidom as DOM
tree = DOM.parse(myFile)
root = tree.documentElement
targetNamespace = root.getAttribute("targetNamespace")
d = dict(root.attributes.items())
for key in d:
if d[key] == targetNamespace: return key
This will return either targetNamespace or xmlns:abw depending on what comes first in the xsd. Of course we should ignore the first case, but this goes out of scope of this question.

Related

Get XPath to attribute

I want to get the actual XPath expression to an attribute node for a specific attribute in an xml element tree (using lxml).
Suppose the following XML tree.
<foo>
<bar attrib_name="hello_world"/>
</foo>
The XPath expression "//#*[local-name() = "attrib_name"]" produces ['hello_world'] which is the values of concerned attributes, and "//#*[local-name() = "attrib_name"]/.." gets me the bar element, which is one level too high, I need the xpath expression to the specific attribute node itself, not its parent xml node, that is having the string 'attrib_name' I want to generate '/foo/bar/#attrib_name'.
from lxml import etree
from io import StringIO
f = StringIO('<foo><bar attrib_name="hello_world"></bar></foo>')
tree = etree.parse(f)
print(tree.xpath('//#*[local-name() = "attrib_name"]'))
# --> ['hello_world']
print([tree.getpath(el) for el in tree.xpath('//#*[local-name() = "attrib_name"]/..')])
# --> ['/foo/bar']
As an add-on this should work with namespaces too.
If you remove the /.. then you will get the _ElementUnicodeResult
This will allow you to append the attribute name to the xpath:
>>> print(['%s/#%s' % (tree.getpath(attrib_result.getparent()), attrib_result.attrname) for attrib_result in tree.xpath('//#*[local-name() = "attrib_name"]')])
['/foo/bar/#attrib_name']
Trying to apply that to namespaces will result in the namespace added to the xpath (which may not be what you want):
>>> tree = etree.parse(StringIO('<foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><bar xsi:attrib_name="hello_world"></bar></foo>'))
>>> print(['%s/#%s' % (tree.getpath(attrib_result.getparent()), attrib_result.attrname) for attrib_result in tree.xpath('//#*[local-name() = "attrib_name"]')])
['/foo/bar/#{http://www.w3.org/2001/XMLSchema-instance}attrib_name']

Parsing XML in Python using the cElementTree module

I have an XML file, which I wanted to convert to a dictionary. I have tried to write the following code but the output is not as expected. I have the following XML file named core-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdfs/tmp</value>
<description>Temporary Directory.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.XXX.X.XXX:XXXX</value>
<description>Use HDFS as file storage engine</description>
</property>
</configuration>
The code that I wrote is:
import xml.etree.cElementTree
import xml.etree.ElementTree as ET
import warnings
warnings.filterwarnings("ignore")
class XmlListConfig(list):
def __init__(self, aList):
for element in aList:
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
self.append(XmlDictConfig(element))
# treat like list
elif element[0].tag == element[1].tag:
self.append(XmlListConfig(element))
elif element.text:
text = element.text.strip()
if text:
self.append(text)
class XmlDictConfig(dict):
def __init__(self, parent_element):
if parent_element.items():
self.update(dict(parent_element.items()))
for element in parent_element:
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
aDict = XmlDictConfig(element)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
aDict = {element[0].tag: XmlListConfig(element)}
# if the tag has attributes, add those to the dict
if element.items():
aDict.update(dict(element.items()))
self.update({element.tag: aDict})
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
self.update({element.tag: dict(element.items())})
# finally, if there are no child tags and no attributes, extract
# the text
else:
self.update({element.tag: element.text})
tree = ET.parse('core-site.xml')
root = tree.getroot()
xmldict = XmlDictConfig(root)
print xmldict
This is the output that I am getting:
{
'property':
{
'name': 'fs.defaultFS',
'value': 'hdfs://192.X.X.X:XXXX',
'description': 'Use HDFS as file storage engine'
}
}
Why isn't the first property tag being shown? It only shows the data in the last property tag.
Since you are using a dict, the second element with the same key property replaces the first element previously recorded in the dict.
You have to use a different data structure, a list of dict for instance.

Filtering XML in Python

I need to write a filter to discard some elements, tags and blocks in my XML Files. In the following you can see what are my xml examples and expected outputs. I am somehow confused about the differences between element, tag, attribute in the elemetTree. My test does not work!
Filter:
import xml.etree.ElementTree as xee
def test(input):
doc=xee.fromstring(input)
print xee.tostring(doc)
#RemoveTimeStampAttribute
for elem in doc.findall('Component'):
if 'timeStamp' in elem.attrib:
del elem.attrib['timeStamp']
#RemoveTimeStampElements
for elem in doc.findall('TimeStamp'):
del elem
print xee.tostring(doc)
return xee.tostring(doc)
First of all, you are removing the attribute incorrectly, see if timeStamp is in the element's attrib dictionary and then use del to remove it:
def amdfilter(input):
doc = xee.fromstring(input)
for node in doc.findall('Component'):
if 'timeStamp' in node.attrib:
del node.attrib['timeStamp']
return xee.tostring(doc)
Also, since you are testing only the attribute removal here, change your expectation to:
expected = '<ComponentMain><Component /></ComponentMain>'
Complete test (it passes):
import unittest
from amdfilter import *
class FilterTest(unittest.TestCase):
def testRemoveTimeStampAttribute(self):
input = '<?xml version="1.0"?><ComponentMain><Component timeStamp="2014"></Component></ComponentMain>'
output = amdfilter(input)
expected = '<ComponentMain><Component /></ComponentMain>'
self.assertEqual(expected, output)
Note that I don't care here about the xml declaration line (it could be easily added).

XPath - Return ALL nodes with certain string pattern

Here is a sample from the doc I am working with:
<idx:index xsi:schemaLocation="http://www.belscript.org/schema/index index.xsd" idx:belframework_version="2.0">
<idx:namespaces>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/entrez-gene-ids-hmr.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/hgnc-approved-symbols.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/mgi-approved-symbols.belns"/>
I can get all nodes with name "namespace" with the following code:
tree = etree.parse(self.old_files)
urls = tree.xpath('//*[local-name()="namespace"]')
This would return a list of the 3 namespace elements. But what if I want to get to the data in the idx:resourceLocation attribute? Here is my attempt at doing that, using the XPath docs as a guide.
urls = tree.xpath('//*[local-name()="namespace"]/#idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/"',
namespaces={'idx' : 'http://www.belscript.org/schema/index'})
What I want is all nodes that have an attribute starting with http://resource.belframework.org/belframework/1.0/namespace. So in the sample doc, it would return me only those strings in the resourceLocation attribute. Unfortunately, the syntax is not quite right, and I am having trouble deriving the proper syntax from the documentation. Thank you!
I think what you are looking for is:
//*[local-name()="namespace"]/#idx:resourceLocation
or
//idx:namespace/#idx:resourceLocation
or, if you want only those #idx:resourceLocation attributes that start with "http://resource.belframework.org/belframework/1.0/namespace" you could use
'''//idx:namespace[
starts-with(#idx:resourceLocation,
"http://resource.belframework.org/belframework/1.0/namespace")]
/#idx:resourceLocation'''
import lxml.etree as ET
content = '''\
<root xmlns:xsi="http://www.xxx.com/zzz/yyy" xmlns:idx="http://www.belscript.org/schema/index">
<idx:index xsi:schemaLocation="http://www.belscript.org/schema/index index.xsd" idx:belframework_version="2.0">
<idx:namespaces>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/entrez-gene-ids-hmr.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/hgnc-approved-symbols.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/mgi-approved-symbols.belns"/>
</idx:namespaces>
</idx:index>
</root>
'''
root = ET.XML(content)
namespaces = {'xsi': 'http://www.xxx.com/zzz/yyy',
'idx': 'http://www.belscript.org/schema/index'}
for item in root.xpath(
'//*[local-name()="namespace"]/#idx:resourceLocation', namespaces=namespaces):
print(item)
yields
http://resource.belframework.org/belframework/1.0/namespace/entrez-gene-ids-hmr.belns
http://resource.belframework.org/belframework/1.0/namespace/hgnc-approved-symbols.belns
http://resource.belframework.org/belframework/1.0/namespace/mgi-approved-symbols.belns

Python - lxml - how to 'move' around the tree when building the tree

Basic question - how do you 'move' around in a tree when you are building a tree.
I can populate the first level:
import lxml.etree as ET
def main():
root = ET.Element('baseURL')
root.attrib["URL"]='www.com'
root.attrib["title"]='Level Title'
myList = [["www.1.com","site 1 Title"],["www.2.com","site 2 Title"],["www.3.com","site 3 Title"]]
for i in xrange(len(myList)):
ET.SubElement(root, "link_"+str(i), URL=myList[i][0], title=myList[i][1])
This gives me something like:
baseURL:
link_0
link_1
link_2
from there, I want to add a subtree from each of the new nodes so it looks something like:
baseURL:
link_0:
link_A
link_B
link_C
link_1
link_2
I can't see how to 'point' the subElement call to the next node down - I tried:
myList2 = [["www.A.com","site A Title"],["www.B.com","site B Title"],["www.C.com","site C Title"]]
for i in xrange(len(myList2)):
ET.SubElement('link_0', "link_"+str(i), URL=myList2[i][0], title=myList2[i][1])
But that throws the error:
TypeError: Argument '_parent' has incorrect type (expected lxml.etree._Element, got str)
as I am giving the subElement call a string, not an element reference. I also tried it as a variable, (i.e. link_0' rather than"link_0"`) and that gives a global missing variable, so my reference is obviously incorrect.
How do I 'point' my lxml builder to a child as a parent, and write a new child?
ET.SubElement(parent_node,type) creates a new XML element node as a child of parent_node. It also returns this new node.
So you could do this:
import lxml.etree as ET
def main():
root = ET.Element('baseURL')
myList = [1,2,3]
children = []
for x in myList:
children.append( ET.SubElement(root, "link_"+str(x)) )
for y in myList:
ET.SubElement( children[0], "child_"+str(y) )
But keeping track of the children is probably excessive since lxml already provides you with many ways to get to them.
Here's a way using lxmls built in children lists:
node = root[0]
for y in myList:
ET.SubElement( node, "child_"+str(y) )
Here's a way using XPath (possibly better if your XML is getting ugly)
node = root.xpath("/baseURL/link_0")[0]
for y in myList:
ET.SubElement( node, "child_"+str(y) )
Found the answer. I should be using the python array referencing, root[n] not trying to get to it via list_0

Categories

Resources