Python parse standalone-full.xml from Wildfly - python

I'm trying to parse the standalone-full.xml from Wildfly 8.1 Final with python to extract some information as datasources.
The example XML below.
<?xml version="1.0" ?>
<server xmlns="urn:jboss:domain:2.1">
<profile>
<subsystem xmlns="urn:jboss:domain:datasources:2.0">
<datasources>
<datasource jndi-name="java:jboss/datasources/JNDI" pool-name="JNDI" enabled="true">
<connection-url>jdbc:oracle:thin:#//HOST</connection-url>
<driver>ojdbc6</driver>
<pool>
<min-pool-size>50</min-pool-size>
<max-pool-size>100</max-pool-size>
</pool>
<security>
<user-name>USER</user-name>
<password>USER</password>
</security>
<validation>
<valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleValidConnectionChecker"/>
<validate-on-match>false</validate-on-match>
<background-validation>true</background-validation>
<background-validation-millis>10000</background-validation-millis>
<exception-sorter class-name="org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter"/>
</validation>
</datasource>
<drivers>
<driver name="h2" module="com.h2database.h2">
<xa-datasource-class>org.h2.jdbcx.JdbcDataSource</xa-datasource-class>
</driver>
<driver name="ojdbc6" module="oracle.ojdbc">
<xa-datasource-class>oracle.ojdbc.xa.client.OracleXADataSource</xa-datasource-class>
</driver>
</drivers>
</datasources>
</subsystem>
</profile>
EDIT: How can I get deeper in the tree?
I tried something like this:
In[16]: from lxml import etree
In[18]: xml = etree.parse('standalone-full.xml')
In[21]: root = xml.getroot()
In[28]: children = root[0].getchildren()
In[31]: children[0]
Out[31]: <Element {urn:jboss:domain:datasources:2.0}subsystem at 0x4bef208>
In[32]: datasources = children[0]
In[33]: datasources.getchildren()
Out[33]: [<Element {urn:jboss:domain:datasources:2.0}datasources at 0x4befa48>]

Your question is rather unspecific, but as far as I can see from the regex you posted, you want to grab the text values of the connection-url, user-name, and password nodes under each datasource node that has a pool-name attribute with a value of JNDI. Here is one possibility of doing that (tested under Python 2.7):
import xml.etree.cElementTree as ET
ns = {'ds': 'urn:jboss:domain:datasources:2.0'}
root = ET.parse('standalone-full.xml').getroot()
children = root.findall(".//ds:datasource[#pool-name='JNDI']", ns)
for child in children:
print child.find("ds:connection-url", ns).text
security = child.find("ds:security", ns)
print security.find("ds:user-name", ns).text
print security.find("ds:password", ns).text

You could use Augeas to parse it:
$ augtool -At "Xml.lns incl $PWD/standalone-full.xml"
augtool> get //standalone-full.xml//datasource//password/#text
//standalone-full.xml//datasource//password/#text = USER
Just use the python-augeas bindings with Python:
import augeas
a = augeas.Augeas(flags=augeas.Augeas.NO_MODL_AUTOLOAD)
a.transform("Xml", "/home/raphink/bas/augeas/standalone-full.xml")
a.load()
v = a.get("//standalone-full.xml//datasource//password/#text")

I've solved my problem with regex which is a bad idea but it works.
import re
data = "standalone-full.xml"
regex_result = re.findall(r'.*:domain:datasources[\S\s]*?pool-name="JNDI"[\S\s]*?connection-url>.*' +
'#//(.*)<.*[\S\s]*?user-name>(.*)<.*\s*<password>(.*)<', data, re.M)

Related

Iterating through xml file

I am trying to get all surnames from xml file, but if I am trying to use find, It throws an exception
TypeError: 'NoneType' object is not iterable
This is my code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root:
for subelem in elem:
for subsubelem in subelem.find('surname'):
print(subsubelem.text)
When I remove the find('surname') from code, It returning all texts from subsubelements.
This is xml:
<?xml version="1.0" encoding="UTF-8"?>
<pp:card xmlns:pp="http://xmlns.page.com/path/subpath">
<pp:id>1</pp:id>
<pp:customers>
<pp:customer>
<pp:name>John</pp:name>
<pp:surname>Walker</pp:surname>
<pp:adress>
<pp:street>Walker street</pp:street>
<pp:number>1/1</pp:number>
<pp:state>England</pp:state>
</pp:adress>
<pp:created>2021-03-08Z</pp:created>
</pp:customer>
<pp:customer>
<pp:name>Michael</pp:name>
<pp:surname>Jordan</pp:surname>
<pp:adress>
<pp:street>Jordan street</pp:street>
<pp:number>28</pp:number>
<pp:state>USA</pp:state>
</pp:adress>
<pp:created>2021-03-09Z</pp:created>
</pp:customer>
</pp:customers>
</pp:card>
How should I fix it?
Not really a python person, but should the "find" statement include the "pp:" in its search, such as,
find('pp:surname')
Neither the opening nor closing tags actually match "surname".
Use the namespace when you call findall
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<pp:card xmlns:pp="http://xmlns.page.com/path/subpath">
<pp:id>1</pp:id>
<pp:customers>
<pp:customer>
<pp:name>John</pp:name>
<pp:surname>Walker</pp:surname>
<pp:adress>
<pp:street>Walker street</pp:street>
<pp:number>1/1</pp:number>
<pp:state>England</pp:state>
</pp:adress>
<pp:created>2021-03-08Z</pp:created>
</pp:customer>
<pp:customer>
<pp:name>Michael</pp:name>
<pp:surname>Jordan</pp:surname>
<pp:adress>
<pp:street>Jordan street</pp:street>
<pp:number>28</pp:number>
<pp:state>USA</pp:state>
</pp:adress>
<pp:created>2021-03-09Z</pp:created>
</pp:customer>
</pp:customers>
</pp:card>'''
ns = {'pp': 'http://xmlns.page.com/path/subpath'}
root = ET.fromstring(xml)
names = [sn.text for sn in root.findall('.//pp:surname', ns)]
print(names)
output
['Walker', 'Jordan']

lxml non-recursive full tag

Given the following xml:
<node a='1' b='1'>
<subnode x='25'/>
</node>
I would like to extract the tagname and all attributes for the first node, i.e., the verbatim code:
<node a='1' b='1'>
without the subnode.
For example in Python, tostring returns too much:
from lxml import etree
root = etree.fromstring("<node a='1' b='1'><subnode x='25'>some text</subnode></node>")
print(etree.tostring(root))
returns
b'<node a="1" b="1"><subnode x="25">some text</subnode></node>'
The following gives the desired result, but is much too verbose:
tag = root.tag
for att, val in root.attrib.items():
tag += ' '+att+'="'+val+'"'
tag = '<'+tag+'>'
print(tag)
result:
<node a="1" b="1">
What is an easier (and guaranteed attribute order preserving) way of doing this?
You can remove all of the subnodes.
from lxml import etree
root = etree.fromstring("<node a='1' b='1'><subnode x='25'>some text</subnode></node>")
for subnode in root.xpath("//subnode"):
subnode.getparent().remove(subnode)
etree.tostring(root) # '<node a="1" b="1"/>'
Alternatively, you can use a simple regex. Order is guaranteed.
import re
res = re.search('<(.*?)>', etree.tostring(root))
res.group(1) # "node a='1' b='1'"

Parsing XML with ElementTree in Python

I have XML like this:
<parameter>
<name>ec_num</name>
<value>none</value>
<units/>
<url/>
<id>2455</id>
<m_date>2008-11-29 13:15:14</m_date>
<user_id>24</user_id>
<user_name>registry</user_name>
</parameter>
<parameter>
<name>swisspro</name>
<value>Q8H6N2</value>
<units/>
I want to parse the XML and extract the <value> entry which is just below the <name> entry marked 'swisspro'. I.e. I want to parse and extract the 'Q8H6N2' value.
How would I do this using ElementTree?
It would by much easier to do via lxml, but here' a solution using ElementTree library:
import xml.etree.ElementTree as ET
data = """<parameters>
<parameter>
<name>ec_num</name>
<value>none</value>
<units/>
<url/>
<id>2455</id>
<m_date>2008-11-29 13:15:14</m_date>
<user_id>24</user_id>
<user_name>registry</user_name>
</parameter>
<parameter>
<name>swisspro</name>
<value>Q8H6N2</value>
<units/>
</parameter>
</parameters>"""
tree = ET.fromstring(data)
for parameter in tree.iter(tag='parameter'):
name = parameter.find('name')
if name is not None and name.text == 'swisspro':
print parameter.find('value').text
break
prints:
Q8H6N2
The idea is pretty simple: iterate over all parameter tags, check the value of the name tag and if it is equal to swisspro, get the value element.
Hope that helps.
Here is an example:
xml file
<span style="font-size:13px;"><?xml version="1.0" encoding="utf-8"?>
<root>
<person age="18">
<name>hzj</name>
<sex>man</sex>
</person>
<person age="19" des="hello">
<name>kiki</name>
<sex>female</sex>
</person>
</root></span>
parse method
from xml.etree import ElementTree
def print_node(node):
'''print basic info'''
print "=============================================="
print "node.attrib:%s" % node.attrib
if node.attrib.has_key("age") > 0 :
print "node.attrib['age']:%s" % node.attrib['age']
print "node.tag:%s" % node.tag
print "node.text:%s" % node.text
def read_xml(text):
'''read xml file'''
# root = ElementTree.parse(r"D:/test.xml") #first method
root = ElementTree.fromstring(text) #second method
# get element
# 1 by getiterator
lst_node = root.getiterator("person")
for node in lst_node:
print_node(node)
# 2 by getchildren
lst_node_child = lst_node[0].getchildren()[0]
print_node(lst_node_child)
# 3 by .find
node_find = root.find('person')
print_node(node_find)
#4. by findall
node_findall = root.findall("person/name")[1]
print_node(node_findall)
if __name__ == '__main__':
read_xml(open("test.xml").read())

Reading Maven Pom xml in Python

I have a pom file that has the following defined:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.welsh</groupId>
<artifactId>my-site</artifactId>
<version>1.0.0</version>
<packaging>pom</packaging>
<profiles>
<profile>
<build>
<plugins>
<plugin>
<groupId>org.welsh.utils</groupId>
<artifactId>site-tool</artifactId>
<version>1.0</version>
<executions>
<execution>
<configuration>
<mappings>
<property>
<name>homepage</name>
<value>/content/homepage</value>
</property>
<property>
<name>assets</name>
<value>/content/assets</value>
</property>
</mappings>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
And I am looking to build a dictionary off the name & value elements under property under the mappings element.
So what I'm trying to figure out how to get all possible mappings elements (Incase of multiple build profiles) so I can get all property elements under it and from reading about Supported XPath syntax the following should print out all possible text/value elements:
import xml.etree.ElementTree as xml
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
for mapping in root.findall('*/mappings'):
for prop in mapping.findall('.//property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which is returning nothing. I tried just printing out all the mappings elements and get:
>>> print root.findall('*/mappings')
[]
And when I print out the everything from root I get:
>>> print root.findall('*')
[<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x10b38bd50>, <Element '{http://maven.apache.org/POM/4.0.0}groupId' at 0x10b38bd90>, <Element '{http://maven.apache.org/POM/4.0.0}artifactId' at 0x10b38bf10>, <Element '{http://maven.apache.org/POM/4.0.0}version' at 0x10b3900d0>, <Element '{http://maven.apache.org/POM/4.0.0}packaging' at 0x10b390110>, <Element '{http://maven.apache.org/POM/4.0.0}name' at 0x10b390150>, <Element '{http://maven.apache.org/POM/4.0.0}properties' at 0x10b390190>, <Element '{http://maven.apache.org/POM/4.0.0}build' at 0x10b390310>, <Element '{http://maven.apache.org/POM/4.0.0}profiles' at 0x10b390390>]
Which made me try to print:
>>> print root.findall('*/{http://maven.apache.org/POM/4.0.0}mappings')
[]
But that's not working either.
Any suggestions would be great.
Thanks,
The main issues of the code in the question are
that it doesn't specify namespaces, and
that it uses */ instead of // which only matches direct children.
As you can see at the top of the XML file, Maven uses the namespace http://maven.apache.org/POM/4.0.0. The attribute xmlns in the root node defines the default namespace. The attribute xmlns:xsi defines a namespace that is only used for xsi:schemaLocation.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
To specify tags like profile in methods like find, you have to specify the namespace as well. For example, you could write the following to find all profile-tags.
import xml.etree as xml
pom = xml.parse('pom.xml')
for profile in pom.findall('//{http://maven.apache.org/POM/4.0.0}profile'):
print(repr(profile))
Also note that I'm using //. Using */ would have the same result for your specific xml file above. However, it would not work for other tags like mappings. Since * represents only one level, */child can be expanded to parent/tag or xyz/tag but not to xyz/parent/tag.
Now, you should be able to come up with something like this to find all mappings:
pom = xml.parse('pom.xml')
map = {}
for mapping in pom.findall('//{http://maven.apache.org/POM/4.0.0}mappings'
'/{http://maven.apache.org/POM/4.0.0}property'):
name = mapping.find('{http://maven.apache.org/POM/4.0.0}name').text
value = mapping.find('{http://maven.apache.org/POM/4.0.0}value').text
map[name] = value
Specifying the namespaces like this is quite verbose. To make it easier to read, you can define a namespace map and pass it as second argument to find and findall:
# ...
nsmap = {'m': 'http://maven.apache.org/POM/4.0.0'}
for mapping in pom.findall('//m:mappings/m:property', nsmap):
name = mapping.find('m:name', nsmap).text
value = mapping.find('m:value', nsmap).text
map[name] = value
Ok, found out that when I remove maven stuff from the project element so its just <project> I can do this:
for mapping in root.findall('*//mappings'):
logging.info(mapping)
for prop in mapping.findall('./property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which would result in:
INFO:root:<Element 'mappings' at 0x10d72d350>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, if I leave the Maven stuff in at the top I can do this:
for mapping in root.findall('*//{http://maven.apache.org/POM/4.0.0}mappings'):
logging.info(mapping)
for prop in mapping.findall('./{http://maven.apache.org/POM/4.0.0}property'):
logging.info(prop.find('{http://maven.apache.org/POM/4.0.0}name').text + " => " + prop.find('{http://maven.apache.org/POM/4.0.0}value').text)
Which results in:
INFO:root:<Element '{http://maven.apache.org/POM/4.0.0}mappings' at 0x10aa7f310>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, I'd love to be able to figure out how to avoid having to account for the maven stuff since it locks me into this one format.
EDIT:
Ok, I managed to get something a bit more verbose:
import xml.etree.ElementTree as xml
def getMappingsNode(node, nodeName):
if node.findall('*'):
for n in node.findall('*'):
if nodeName in n.tag:
return n
else:
return getMappingsNode(n, nodeName)
def getMappings(rootNode):
mappingsNode = getMappingsNode(rootNode, 'mappings')
mapping = {}
for prop in mappingsNode.findall('*'):
key = ''
val = ''
for child in prop.findall('*'):
if 'name' in child.tag:
key = child.text
if 'value' in child.tag:
val = child.text
if val and key:
mapping[key] = val
return mapping
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
mappings = getMappings(root)
print mappings

Accessing XMLNS attribute with Python Elementree?

How can one access NS attributes through using ElementTree?
With the following:
<data xmlns="http://www.foo.net/a" xmlns:a="http://www.foo.net/a" book="1" category="ABS" date="2009-12-22">
When I try to root.get('xmlns') I get back None, Category and Date are fine, Any help appreciated..
I think element.tag is what you're looking for. Note that your example is missing a trailing slash, so it's unbalanced and won't parse. I've added one in my example.
>>> from xml.etree import ElementTree as ET
>>> data = '''<data xmlns="http://www.foo.net/a"
... xmlns:a="http://www.foo.net/a"
... book="1" category="ABS" date="2009-12-22"/>'''
>>> element = ET.fromstring(data)
>>> element
<Element {http://www.foo.net/a}data at 1013b74d0>
>>> element.tag
'{http://www.foo.net/a}data'
>>> element.attrib
{'category': 'ABS', 'date': '2009-12-22', 'book': '1'}
If you just want to know the xmlns URI, you can split it out with a function like:
def tag_uri_and_name(elem):
if elem.tag[0] == "{":
uri, ignore, tag = elem.tag[1:].partition("}")
else:
uri = None
tag = elem.tag
return uri, tag
For much more on namespaces and qualified names in ElementTree, see effbot's examples.
Look at the effbot namespaces documentation/examples; specifically the parse_map function. It shows you how to add an *ns_map* attribute to each element which contains the prefix/URI mapping that applies to that specific element.
However, that adds the ns_map attribute to all the elements. For my needs, I found I wanted a global map of all the namespaces used to make element look up easier and not hardcoded.
Here's what I came up with:
import elementtree.ElementTree as ET
def parse_and_get_ns(file):
events = "start", "start-ns"
root = None
ns = {}
for event, elem in ET.iterparse(file, events):
if event == "start-ns":
if elem[0] in ns and ns[elem[0]] != elem[1]:
# NOTE: It is perfectly valid to have the same prefix refer
# to different URI namespaces in different parts of the
# document. This exception serves as a reminder that this
# solution is not robust. Use at your own peril.
raise KeyError("Duplicate prefix with different URI found.")
ns[elem[0]] = "{%s}" % elem[1]
elif event == "start":
if root is None:
root = elem
return ET.ElementTree(root), ns
With this you can parse an xml file and obtain a dict with the namespace mappings. So, if you have an xml file like the following ("my.xml"):
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"\
>
<feed>
<item>
<title>Foo</title>
<dc:creator>Joe McGroin</dc:creator>
<description>etc...</description>
</item>
</feed>
</rss>
You will be able to use the xml namepaces and get info for elements like dc:creator:
>>> tree, ns = parse_and_get_ns("my.xml")
>>> ns
{u'content': '{http://purl.org/rss/1.0/modules/content/}',
u'dc': '{http://purl.org/dc/elements/1.1/}'}
>>> item = tree.find("/feed/item")
>>> item.findtext(ns['dc']+"creator")
'Joe McGroin'
Try this:
import xml.etree.ElementTree as ET
import re
import sys
with open(sys.argv[1]) as f:
root = ET.fromstring(f.read())
xmlns = ''
m = re.search('{.*}', root.tag)
if m:
xmlns = m.group(0)
print(root.find(xmlns + 'the_tag_you_want').text)

Categories

Resources