lxml non-recursive full tag

lxml non-recursive full tag - python

Given the following xml:
<node a='1' b='1'>
<subnode x='25'/>
</node>
I would like to extract the tagname and all attributes for the first node, i.e., the verbatim code:
<node a='1' b='1'>
without the subnode.
For example in Python, tostring returns too much:
from lxml import etree
root = etree.fromstring("<node a='1' b='1'><subnode x='25'>some text</subnode></node>")
print(etree.tostring(root))
returns
b'<node a="1" b="1"><subnode x="25">some text</subnode></node>'
The following gives the desired result, but is much too verbose:
tag = root.tag
for att, val in root.attrib.items():
tag += ' '+att+'="'+val+'"'
tag = '<'+tag+'>'
print(tag)
result:
<node a="1" b="1">
What is an easier (and guaranteed attribute order preserving) way of doing this?

You can remove all of the subnodes.
from lxml import etree
root = etree.fromstring("<node a='1' b='1'><subnode x='25'>some text</subnode></node>")
for subnode in root.xpath("//subnode"):
subnode.getparent().remove(subnode)
etree.tostring(root) # '<node a="1" b="1"/>'
Alternatively, you can use a simple regex. Order is guaranteed.
import re
res = re.search('<(.*?)>', etree.tostring(root))
res.group(1) # "node a='1' b='1'"

Related

Extracting Child XML using ElementTree ignoring Namespace

I have the following XML that I would like to extract a portion of the child if name matches "Adam"
<data>
<a:config version="1.0" xmlns:a="uri:abc.com/a" xmlns:b="uri:abc.com/b">
<a:xxx config="ABC">
<set>option_on</set>
<location>/123/123</location>
<data>123</data>
</a:xxx>
<a:xxx name="Adam">
<a:yyy value="5555-5555">
<log>true</log>
</a:yyy>
</a:xxx>
<a:xxx name="Lisa">
<a:yyy value="2222-2222">
<log>false</log>
</a:yyy>
</a:xxx>
</a:config>
</data>
I manage to extract the section but it doesn't output the original namespace rather it is showing ns0 and ns1. Below is my code
import xml.etree.ElementTree as ET
tree2 = ET.parse("mycode.xml")
root2= tree2.getroot()
for elem in tree2.iter(tag='{uri:abc.com/a}xxx'):
match = elem.get('name')
if match == "Adam":
bla = ET.dump(elem)
Output as follows: -
<ns0:xxx xmlns:ns0="uri:abc.com/a" name="Adam">
<ns0:yyy value="5555-5555">
<log>true</log>
</ns0:yyy>
</ns0:xxx>
I am hoping to get exactly as what the original document is:-
<a:xxx name="Adam">
<a:yyy value="5555-5555">
<log>true</log>
</a:yyy>
</a:xxx>

Use the register_namespace function.
import xml.etree.ElementTree as ET
tree2 = ET.parse("mycode.xml")
root2 = tree2.getroot()
# Register the 'a' prefix to be used when serializing
ET.register_namespace("a", "uri:abc.com/a")
for elem in tree2.iter(tag='{uri:abc.com/a}xxx'):
match = elem.get('name')
if match == "Adam":
bla = ET.dump(elem)
Output:
<a:xxx xmlns:a="uri:abc.com/a" name="Adam">
<a:yyy value="5555-5555">
<log>true</log>
</a:yyy>
</a:xxx>
This is not the exact output that you asked for. You cannot force ElementTree to omit the namespace declaration (because doing so would make the output ill-formed).

Python parse standalone-full.xml from Wildfly

I'm trying to parse the standalone-full.xml from Wildfly 8.1 Final with python to extract some information as datasources.
The example XML below.
<?xml version="1.0" ?>
<server xmlns="urn:jboss:domain:2.1">
<profile>
<subsystem xmlns="urn:jboss:domain:datasources:2.0">
<datasources>
<datasource jndi-name="java:jboss/datasources/JNDI" pool-name="JNDI" enabled="true">
<connection-url>jdbc:oracle:thin:#//HOST</connection-url>
<driver>ojdbc6</driver>
<pool>
<min-pool-size>50</min-pool-size>
<max-pool-size>100</max-pool-size>
</pool>
<security>
<user-name>USER</user-name>
<password>USER</password>
</security>
<validation>
<valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.oracle.OracleValidConnectionChecker"/>
<validate-on-match>false</validate-on-match>
<background-validation>true</background-validation>
<background-validation-millis>10000</background-validation-millis>
<exception-sorter class-name="org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter"/>
</validation>
</datasource>
<drivers>
<driver name="h2" module="com.h2database.h2">
<xa-datasource-class>org.h2.jdbcx.JdbcDataSource</xa-datasource-class>
</driver>
<driver name="ojdbc6" module="oracle.ojdbc">
<xa-datasource-class>oracle.ojdbc.xa.client.OracleXADataSource</xa-datasource-class>
</driver>
</drivers>
</datasources>
</subsystem>
</profile>
EDIT: How can I get deeper in the tree?
I tried something like this:
In[16]: from lxml import etree
In[18]: xml = etree.parse('standalone-full.xml')
In[21]: root = xml.getroot()
In[28]: children = root[0].getchildren()
In[31]: children[0]
Out[31]: <Element {urn:jboss:domain:datasources:2.0}subsystem at 0x4bef208>
In[32]: datasources = children[0]
In[33]: datasources.getchildren()
Out[33]: [<Element {urn:jboss:domain:datasources:2.0}datasources at 0x4befa48>]

Your question is rather unspecific, but as far as I can see from the regex you posted, you want to grab the text values of the connection-url, user-name, and password nodes under each datasource node that has a pool-name attribute with a value of JNDI. Here is one possibility of doing that (tested under Python 2.7):
import xml.etree.cElementTree as ET
ns = {'ds': 'urn:jboss:domain:datasources:2.0'}
root = ET.parse('standalone-full.xml').getroot()
children = root.findall(".//ds:datasource[#pool-name='JNDI']", ns)
for child in children:
print child.find("ds:connection-url", ns).text
security = child.find("ds:security", ns)
print security.find("ds:user-name", ns).text
print security.find("ds:password", ns).text

You could use Augeas to parse it:
$ augtool -At "Xml.lns incl $PWD/standalone-full.xml"
augtool> get //standalone-full.xml//datasource//password/#text
//standalone-full.xml//datasource//password/#text = USER
Just use the python-augeas bindings with Python:
import augeas
a = augeas.Augeas(flags=augeas.Augeas.NO_MODL_AUTOLOAD)
a.transform("Xml", "/home/raphink/bas/augeas/standalone-full.xml")
a.load()
v = a.get("//standalone-full.xml//datasource//password/#text")

I've solved my problem with regex which is a bad idea but it works.
import re
data = "standalone-full.xml"
regex_result = re.findall(r'.*:domain:datasources[\S\s]*?pool-name="JNDI"[\S\s]*?connection-url>.*' +
'#//(.*)<.*[\S\s]*?user-name>(.*)<.*\s*<password>(.*)<', data, re.M)

Removing parent element and all subelements from XML

Given an XML file with the following structure:
<Root>
<Stuff></Stuff>
<MoreStuff></MoreStuff>
<Targets>
<Target>
<ID>12345</ID>
<Type>Ground</Type>
<Size>Large</Size>
</Target>
<Target>
...
</Target>
</Targets>
</Root>
I'm trying to loop through each child under the <Targets> element, check each <ID> for a specific value, and if the value is found, then I want to delete the entire <Target> entry. I've been using the ElementTree Python library with little success. Here's what I have so far:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
iterator = root.getiterator('Target')
for item in iterator:
old = item.find('ID')
text = old.text
if '12345' in text:
item.remove(old)
tree.write('out.xml')
The problem I'm having with this approach is that only the <ID> sub element is removed, however I need the entire <Target> element and all of its child elements removed. Can anyone help! Thanks.

Unfortunately, element tree elements don't know who their parents are. There is a workaround -- You can build the mapping yourself:
tree = ET.parse('file.xml')
root = tree.getroot()
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
# list so that we don't mess up the order of iteration when removing items.
iterator = list(root.getiterator('Target'))
for item in iterator:
old = item.find('ID')
text = old.text
if '12345' in text:
parent_map[item].remove(item)
continue
tree.write('out.xml')
Untested

You need to keep a reference to the Targets element so that you can remove its children, so start your iteration from there. Grab each Target, check your condition and remove what you don't like.
#!/usr/bin/env python
import xml.etree.ElementTree as ET
xmlstr="""<Root>
<Stuff></Stuff>
<MoreStuff></MoreStuff>
<Targets>
<Target>
<ID>12345</ID>
<Type>Ground</Type>
<Size>Large</Size>
</Target>
<Target>
...
</Target>
</Targets>
</Root>"""
root = ET.fromstring(xmlstr)
targets = root.find('Targets')
for target in targets.findall('Target'):
_id = target.find('ID')
if _id is not None and '12345' in _id.text:
targets.remove(target)
print ET.tostring(root)

Parsing XML with ElementTree in Python

I have XML like this:
<parameter>
<name>ec_num</name>
<value>none</value>
<units/>
<url/>
<id>2455</id>
<m_date>2008-11-29 13:15:14</m_date>
<user_id>24</user_id>
<user_name>registry</user_name>
</parameter>
<parameter>
<name>swisspro</name>
<value>Q8H6N2</value>
<units/>
I want to parse the XML and extract the <value> entry which is just below the <name> entry marked 'swisspro'. I.e. I want to parse and extract the 'Q8H6N2' value.
How would I do this using ElementTree?

It would by much easier to do via lxml, but here' a solution using ElementTree library:
import xml.etree.ElementTree as ET
data = """<parameters>
<parameter>
<name>ec_num</name>
<value>none</value>
<units/>
<url/>
<id>2455</id>
<m_date>2008-11-29 13:15:14</m_date>
<user_id>24</user_id>
<user_name>registry</user_name>
</parameter>
<parameter>
<name>swisspro</name>
<value>Q8H6N2</value>
<units/>
</parameter>
</parameters>"""
tree = ET.fromstring(data)
for parameter in tree.iter(tag='parameter'):
name = parameter.find('name')
if name is not None and name.text == 'swisspro':
print parameter.find('value').text
break
prints:
Q8H6N2
The idea is pretty simple: iterate over all parameter tags, check the value of the name tag and if it is equal to swisspro, get the value element.
Hope that helps.

Here is an example:
xml file
<span style="font-size:13px;"><?xml version="1.0" encoding="utf-8"?>
<root>
<person age="18">
<name>hzj</name>
<sex>man</sex>
</person>
<person age="19" des="hello">
<name>kiki</name>
<sex>female</sex>
</person>
</root></span>
parse method
from xml.etree import ElementTree
def print_node(node):
'''print basic info'''
print "=============================================="
print "node.attrib:%s" % node.attrib
if node.attrib.has_key("age") > 0 :
print "node.attrib['age']:%s" % node.attrib['age']
print "node.tag:%s" % node.tag
print "node.text:%s" % node.text
def read_xml(text):
'''read xml file'''
# root = ElementTree.parse(r"D:/test.xml") #first method
root = ElementTree.fromstring(text) #second method
# get element
# 1 by getiterator
lst_node = root.getiterator("person")
for node in lst_node:
print_node(node)
# 2 by getchildren
lst_node_child = lst_node[0].getchildren()[0]
print_node(lst_node_child)
# 3 by .find
node_find = root.find('person')
print_node(node_find)
#4. by findall
node_findall = root.findall("person/name")[1]
print_node(node_findall)
if __name__ == '__main__':
read_xml(open("test.xml").read())

Accessing XMLNS attribute with Python Elementree?

How can one access NS attributes through using ElementTree?
With the following:
<data xmlns="http://www.foo.net/a" xmlns:a="http://www.foo.net/a" book="1" category="ABS" date="2009-12-22">
When I try to root.get('xmlns') I get back None, Category and Date are fine, Any help appreciated..

I think element.tag is what you're looking for. Note that your example is missing a trailing slash, so it's unbalanced and won't parse. I've added one in my example.
>>> from xml.etree import ElementTree as ET
>>> data = '''<data xmlns="http://www.foo.net/a"
... xmlns:a="http://www.foo.net/a"
... book="1" category="ABS" date="2009-12-22"/>'''
>>> element = ET.fromstring(data)
>>> element
<Element {http://www.foo.net/a}data at 1013b74d0>
>>> element.tag
'{http://www.foo.net/a}data'
>>> element.attrib
{'category': 'ABS', 'date': '2009-12-22', 'book': '1'}
If you just want to know the xmlns URI, you can split it out with a function like:
def tag_uri_and_name(elem):
if elem.tag[0] == "{":
uri, ignore, tag = elem.tag[1:].partition("}")
else:
uri = None
tag = elem.tag
return uri, tag
For much more on namespaces and qualified names in ElementTree, see effbot's examples.

Look at the effbot namespaces documentation/examples; specifically the parse_map function. It shows you how to add an *ns_map* attribute to each element which contains the prefix/URI mapping that applies to that specific element.
However, that adds the ns_map attribute to all the elements. For my needs, I found I wanted a global map of all the namespaces used to make element look up easier and not hardcoded.
Here's what I came up with:
import elementtree.ElementTree as ET
def parse_and_get_ns(file):
events = "start", "start-ns"
root = None
ns = {}
for event, elem in ET.iterparse(file, events):
if event == "start-ns":
if elem[0] in ns and ns[elem[0]] != elem[1]:
# NOTE: It is perfectly valid to have the same prefix refer
# to different URI namespaces in different parts of the
# document. This exception serves as a reminder that this
# solution is not robust. Use at your own peril.
raise KeyError("Duplicate prefix with different URI found.")
ns[elem[0]] = "{%s}" % elem[1]
elif event == "start":
if root is None:
root = elem
return ET.ElementTree(root), ns
With this you can parse an xml file and obtain a dict with the namespace mappings. So, if you have an xml file like the following ("my.xml"):
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"\
>
<feed>
<item>
<title>Foo</title>
<dc:creator>Joe McGroin</dc:creator>
<description>etc...</description>
</item>
</feed>
</rss>
You will be able to use the xml namepaces and get info for elements like dc:creator:
>>> tree, ns = parse_and_get_ns("my.xml")
>>> ns
{u'content': '{http://purl.org/rss/1.0/modules/content/}',
u'dc': '{http://purl.org/dc/elements/1.1/}'}
>>> item = tree.find("/feed/item")
>>> item.findtext(ns['dc']+"creator")
'Joe McGroin'

Try this:
import xml.etree.ElementTree as ET
import re
import sys
with open(sys.argv[1]) as f:
root = ET.fromstring(f.read())
xmlns = ''
m = re.search('{.*}', root.tag)
if m:
xmlns = m.group(0)
print(root.find(xmlns + 'the_tag_you_want').text)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml non-recursive full tag - python

Related

Extracting Child XML using ElementTree ignoring Namespace

Python parse standalone-full.xml from Wildfly

Removing parent element and all subelements from XML

Parsing XML with ElementTree in Python

Accessing XMLNS attribute with Python Elementree?

Categories

Resources