Using Python ElementTree/ElementInclude and xpointer to access included XML files - python

I have a 'main.xml' file that includes 2 'sub_x.xml' file. The include lines are using 'xpointer' to only point/include specific tags of the include xml's. When I use ElementTree to determine if this worked correctly, it shows that the whole 'sub' xml files are being included and not just the tags I want. I am not sure if I am using xpointer incorrectly or ElementTree or ElementInclude does not support this. Here are the files:
------'main.xml'--------
`<?xml version='1.0' encoding='utf-8'?>
<ModelInfo xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="sub_1.xml" xpointer="xpointer(//ModelInfo/Model)" parse="xml" />
<xi:include href="sub_2.xml" xpointer="xpointer(//ModelInfo/Model)" parse="xml" />
</ModelInfo>`
-------'sub_1.xml'------
`<?xml version="1.0" ?>
<ModelInfo>
<Model ModelName="glow">
<Variables>
<Variable Alias="glow_val" Input="False" Output="True" />
</Variables>
</Model>
</ModelInfo>`
-------'sub_2.xml'------
`<?xml version='1.0' encoding='utf-8'?>
<ModelInfo>
<Model ModelName="sirpwr_b_supply8v1">
<Variables>
<Variable Alias="sirpwr_a_supplyecu_Snsr8vIstat" Input="True" Output="False" />
</Variables>
</Model>
</ModelInfo>`
I would like 'main.xml' to appear to ElementTree as:
`<?xml version='1.0' encoding='utf-8'?>
<ModelInfo xmlns:xi="http://www.w3.org/2001/XInclude">
<Model ModelName="glow">
<Variables>
<Variable Alias="glow_val" Input="False" Output="True" />
</Variables>
</Model>
<Model ModelName="sirpwr_b_supply8v1">
<Variables>
<Variable Alias="sirpwr_a_supplyecu_Snsr8vIstat" Input="True" Output="False" />
<Variable Alias="sirpwr_b_supply8v1_qstat" Input="False" Output="True" />
</Variables>
</Model>
</ModelInfo>`
The script I am running to load the XML files and test is:
`tree = ElementTree.parse('main.xml')
root = tree.getroot()
ElementInclude.include(root)
for element in root:
print element.tag`
xpointer is not working because 'ModelInfo' is being copied over from the 'sub_x' xml files.

ElementInclude does not support all of XInclude. The xpointer attribute on the <include> element is ignored.
It does work the way you want it with lxml and the xinclude() method:
from lxml import etree
tree = etree.parse('main.xml')
tree.xinclude()
print etree.tostring(tree)
Note that the XPointer xpointer() scheme never reached the status of W3C Recommendation (it's still just a working draft). It has been implemented in libxml2 (the C library behind lxml) but almost nowhere else.

Related

How to find an XML tag buried fairly deep, delete the tag if it is a match and save the XML as a string?

Say that I have the following XML and that I am using Python. I am using xml.etree.ElementTree.
<?xml version='1.0' encoding='UTF-8'?>
<results preview='0'>
<meta>
<fieldOrder>
<field>count</field>
</fieldOrder>
</meta>
<result offset='0'>
<field k='count'>
<value>
<text1>6</text>
<text2>7</text>
<text3>8</text>
</value>
</field>
</result>
</results>
Is there an easy way for me to go down into the XML and also delete any text2 elements?
Desired result:
<Data?xml version='1.0' encoding='UTF-8'?>
<results preview='0'>
<meta>
<fieldOrder>
<field>count</field>
</fieldOrder>
</meta>
<result offset='0'>
<field k='count'>
<value>
<text1>6</text>
<text3>8</text>
</value>
</field>
</result>
</results>
Your sample xml is still not well formed (the opening and closing tags of the "text" children of <value> don't match.
Assuming that's fixed (that is, each closing tag matches the opening tag) the following should work:
parent = root.find('.//value[text2]')
target = parent.find('./text2')
parent.remove(target)
print(ET.tostring(doc, xml_declaration = True).decode())
The output should be your (fixed) expected output.

CDATA sections and comments are lost when parsing XML with ElementTree

I am editing xml files, I ran into the problem that when changing a file in a python script, its structure is lost.
Xml file:
<?xml version="1.0" encoding="UTF-8"?>
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path><![CDATA[path]]></path>
<code_main />
</errors>
<reference>3</reference>
</element>
....
</main>
Используя:
tree = ET.parse(xml_file).write("test.xml", encoding='utf-8', xml_declaration=True)
I lose all comments in the file, while if I compare the original file with the modified one using diff (in linux), the files are shown as completely different
Is there a way to change the xml file (my task is to add a subelement to <element>), while leaving the overall structure of the file unchanged, including comments and order.
The order and comments are fundamental in the file
UPD:
After executing the above code, I get it from the source xml in the following form:
<?xml version='1.0' encoding='utf-8'?>
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path>path</path>
<code_main />
</errors>
<reference>3</reference>
</element>
</main>
Pay attention to <path>
Comments are also not saved at the same time:
Source:
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path><![CDATA[path]]></path>
<!--Stt-->
<code_main />
</errors>
<reference>3</reference>
</element>
</main>
Modified:
<main>
<element formatVersion="1.0">
<firstValue>firstText</firstValue>
<secondValue>secondText</secondValue>
<thirdValue>thirdText</thirdValue>
<errors>
<path>path</path>
<code_main />
</errors>
<reference>3</reference>
</element>
</main>

Adding a new XML element using python ElementTree library

I'm trying to add a new element to an xml file using the python ElementTree library with the following code.
from xml.etree import ElementTree as et
def UpdateXML(pre):
xml_file = place/file.xml
tree = et.parse(xml_file)
root = tree.getroot()
for parent in root.findall('Parent'):
et.SubElement(parent,"NewNode", attribute=pre)
tree.write(xml_file)
The XML I want it to render is in the following format
<Parent>
<Child1 Attribute="Stuff"/>
<NewNode Attribute="MoreStuff"/> <--- new
<Child3>
<Child4>
<CHild5>
<Child6>
</Parent>
However the xml it actually renders is in this incorrect format
<Parent>
<Child1 Attribute="Stuff"/>
<Child3>
<Child4>
<CHild5>
<Child6>
<NewNode Attribute="MoreStuff"/> <--- new
</Parent>
What do I change in my code to render the correct xml?
You want the insert operation:
node = et.Element('NewNode')
parent.insert(1,node)
Which in my testing gets me:
<Parent>
<Child1 Attribute="Stuff" />
<NewNode /><Child3 />
<Child4 />
<CHild5 />
<Child6 />
</Parent>

How to remove all attributes of a tag

How can I remove all the attributes of a xml tag so I can get from this:
<xml blah blah blah> to just <xml>.
With lxml I know I can remove the whole element and I didn't find any way to do it specific on a tag. (I found solutions on stackoverflow for C# but I want Python).
I am opening a gpx(xml) file and this is my code so far (based on How do I get the whole content between two xml tags in Python?):
from lxml import etree
t = etree.parse("1.gpx")
e = t.xpath('//trk')[0]
print(e.text + ''.join(map(etree.tostring, e))).strip()
Another approach I did was this:
from lxml import etree
TOPOGRAFIX_NS = './/{http://www.topografix.com/GPX/1/1}'
TRACKPOINT_NS = TOPOGRAFIX_NS + 'extensions/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}'
doc1 = etree.parse("1.gpx")
for node1 in doc1.findall(TOPOGRAFIX_NS + 'trk'):
node_to_string1 = etree.tostring(node1)
print(node_to_string1)
But I get the trk tag with TOPOGRAFIX_NS attributes witch I don't want and here I am wanting to remove the tag attribute. I just want to get:
<trk> all the inside content </trk>
Thank you very much!
P.S. The content of the gpx file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="Endomondo.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<author>
<name>Blah Blah</name>
<email id="blah" domain="blah.com"/>
</author>
<link href="http://www.endomondo.com">
<text>Endomondo</text>
</link>
<time>2014-01-20T10:50:28Z</time>
</metadata>
<trk>
<name>Galati</name>
<src>http://www.endomondo.com/</src>
<link href="http://www.endomondo.com/workouts/260782567/13005122">
<text>Galati</text>
</link>
<type>MOUNTAIN_BIKING</type>
<trkseg>
<trkpt lat="45.431074" lon="28.021038">
<time>2013-10-20T05:49:04Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>

Python ElementTree find() using a wildcard?

I am parsing an XML feed in python to extract certain tags. My XML contains namespaces and this results in each tag containing a namespace followed by tag name.
Here is the xml:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:rte="http://www.rte.ie/schemas/vod">
<id>10038711/</id>
<updated>2013-01-24T22:52:43+00:00</updated>
<title type="text">Reeling in the Years</title>
<logo>http://www.rte.ie/iptv/images/logo.gif</logo>
<link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist?type=iptv&showId=10038711" />
<category term="feed"/>
<author>
<name>RTE</name>
<uri>http://www.rte.ie</uri>
</author>
<entry>
<id>10038711</id>
<published>2012-07-04T12:00:00+01:00</published>
<updated>2013-01-06T12:31:25+00:00</updated>
<title type="text">Reeling in the Years</title>
<content type="text">National and international events with popular music from the year 1989.First Broadcast: 08/11/1999</content>
<category term="WEB Exclusive" rte:type="channel"/>
<category term="Classics 1980" rte:type="genre"/>
<category term="rte player" rte:type="source"/>
<category term="" rte:type="transmision_details"/>
<category term="False" rte:type="copyprotectionoptout"/>
<category term="long" rte:type="form"/>
<category term="3275" rte:type="progid"/>
<link rel="site" type="text/html" href="http://www.rte.ie/tv50/"/>
<link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist/?itemId=10038711&type=iptv&format=xml" />
<link rel="alternate" type="text/html" href="http://www.rte.ie/player/#v=10038711"/>
<rte:valid start="2012-07-23T15:56:04+01:00" end="2017-08-01T15:56:04+01:00"/>
<rte:duration ms="842205" formatted="0:10"/>
<rte:statistics views="19"/>
<rte:bri id="na"/>
<rte:channel id="13"/>
<rte:item id="10038711"/>
<media:title type="plain">Reeling in the Years</media:title>
<media:description type="plain">National and international events with popular music from the year 1989. First Broadcast: 08/11/1999</media:description>
<media:thumbnail url="http://img.rasset.ie/00062efc200.jpg" height="288" width="512" time="00:00:00+00:00"/>
<media:teaserimgref1x1 url="" time="00:00:00+00:00"/>
<media:rating scheme="http://www.rte.ie/schemes/vod">NA</media:rating>
<media:copyright>RTÉ</media:copyright>
<media:group rte:format="single">
<media:content url="http://vod.hds.rasset.ie/manifest/2012/0728/20120728_reelingint_cl10038711_10039316_260_.f4m" type="video/mp4" medium="video" expression="full" duration="842205" rte:format="content"/>
</media:group>
<rte:ads>
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre2&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre3&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
</rte:ads>
</entry>
<!-- playlist.xml -->
</feed>
When the XML is parsed each element is soming out as:
{http://www.w3.org/2005/Atom}id
{http://www.w3.org/2005/Atom}published
{http://www.w3.org/2005/Atom}updated
.....
.....
{http://www.rte.ie/schemas/vod}valid
{http://www.rte.ie/schemas/vod}duration
....
....
{http://search.yahoo.com/mrss/}description
{http://search.yahoo.com/mrss/}thumbnail
....
As I have 3 different namespaces and I cannot gurantee that they will be always the same then I woul prefer not to hard specify each tag like so:
for elem in tree.iter({http://www.w3.org/2005/Atom}entry'):
stream = str(elem.find('{http://www.w3.org/2005/Atom}id').text)
date_tmp = str(elem.find('{http://www.w3.org/2005/Atom}published').text)
name_tmp = str(elem.find('{http://www.w3.org/2005/Atom}title').text)
short_tmp = str(elem.find('{http://www.w3.org/2005/Atom}content').text)
channel_tmp = elem.find('{http://www.w3.org/2005/Atom}category', "channel")
channel = str(channel_tmp.get('term'))
icon_tmp = elem.find('{http://search.yahoo.com/mrss/}thumbnail')
icon_url = str(icon_tmp.get('url'))
Is there any way that I can put a wildcard or something similar into the find so it will simply ignore the namespace?
stream = str(elem.find('*id').text)
I can hardcode them as above but it would be my luck that down the line the namespace would change and my queries stop returning data..
Thanks for the help.
You can use an XPath expression with the local-name() function:
<?xml version="1.0"?>
<root xmlns="ns">
<tag/>
</root>
Assuming "doc" is the ElementTree for the above XML:
import lxml.etree
doc = lxml.etree.parse(<some_file_like_object>)
root = doc.getroot()
root.xpath('//*[local-name()="tag"]')
[<Element {ns}tag at 0x7fcde6f7c960>]
replacing <some_file_like_object> as appropriate (alternatively, you can use lxml.etree.fromstring with an XML string to get the root element directly).

Categories

Resources