lxml (or lxml.html): print tree structure - python

I'd like to print out the tree structure of an etree (formed from an html document) in a differentiable way (means that two etrees should print out differently).
What I mean by structure is the "shape" of the tree, which basically means all the tags but no attribute and no text content.
Any idea? Is there something in lxml to do that?
If not, I guess I have to iterate through the whole tree and construct a string from that. Any idea how to represent the tree in a compact way? (the "compact" feature is less relevant)
FYI it is not intended to be looked at, but to be stored and hashed to be able to make differences between several html templates.
Thanks

Maybe just run some XSLT over the source XML to strip everything but the tags, it's then easy enough to use etree.tostring to get a string you could hash...
from lxml import etree as ET
def pp(e):
print ET.tostring(e, pretty_print=True)
print
root = ET.XML("""\
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8" />
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
""")
pp(root)
xslt = ET.XML("""\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
""")
tr = ET.XSLT(xslt)
doc2 = tr(root)
root2 = doc2.getroot()
pp(root2)
Gives you the output:
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8"/>
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
<project>
<livefolder/>
<livefolder/>
<preference-set>
<boolean/>
</preference-set>
</project>

Related

Parsing an xml file using lxml

I'm trying to edit an xml file by finding each Watts tag and changing the text in it. So far I've managed to change all tags, but not the Watts tag specifically.
My parser is:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
This keeps my cycling.xml file unchanged. A snippet from output.xml (which is also the cycling.xml file since this is unchanged) is:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
If I change my parser to change all tags with:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
Then my output.xml file becomes:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
How can I change just the Watts tag?
I don't understand what the root = tree.getroot() does. I just thought I'd ask this question at the same time, although I'm not sure it matters in my particular problem.
Your document defines a default XML namespace. Look at the xmlns= attribute at the end of the opening tag:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
This means there is no element named "Watts" in your document; you will need to qualify tag names with the appropriate namespace. If you print out the value of watt.tag in our loop, you will see:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
With this in mind, you can modify your filter so that it looks like
this:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
You can read more about namespace handling in the lxml documentation.
Alternatively, since you use two important words edit xml and you are using lxml, consider XSLT (the XML transformation language) where you can define a namespace prefix and change Watts anywhere in document without looping. Plus, you can pass values into XSLT from Python!
XSLT (save as .xsl file)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Pull out sections from XML in Python

Please note that I have some Python experience but not a lot of deep experience so please bear with me.
I have a very large XML file, ~100 megs, that has many, many sections and subsections. I need to pull out each subsection of a certain type (and there are a lot with this type) and write each to a different file. The writing I can handle, but I'm staring at ElementTree documentation trying to make sense of how to traverse the tree, find an element declared this way, yank out just the data between those tags and process it, then continue down the file.
The structure is similar to this (slightly obfuscated). What I want to do is pull out each section labeled "data" individually.
<filename>
<config>
<collections>
<datas>
<data>
...
</data>
<data>
...
</data>
<data>
...
</data>
</datas>
</collections>
</config>
</filename>
I think you can read in each data element using iterparse and then write it out, the following simply prints the element using the print function but you could of course instead write it to a file:
import xml.etree.ElementTree as ET
for event, elem in ET.iterparse("input.xml"):
if elem.tag == 'data':
print(ET.tostring(elem, 'UTF-8', 'xml'))
elem.clear()
Consider an XSLT solution with Python's third-party module, lxml. Specifically, you xpath() for the length of <data> nodes and then iteratively build a dynamic XSLT script parsing only the needed element by node index [#] for outputted individual XML files:
import lxml.etree as et
dom = et.parse('Input.xml')
datalen = len(dom.xpath("//data"))
for i in range(1, datalen+1):
xsltstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="datas">
<xsl:apply-templates select="data[{0}]" />
</xsl:template>
<xsl:template match="data[{0}]">
<xsl:copy>
<xsl:copy-of select="*"/>
</xsl:copy>
</xsl:template>
</xsl:transform>'''.format(i)
xslt = et.fromstring(xsltstr)
transform = et.XSLT(xslt)
newdom = transform(dom)
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True,
xml_declaration=True)
xmlfile = open('Data{}.xml', 'wb')
xmlfile.write(tree_out)
xmlfile.close()

Python etree XSLT Requires Tag output?

I'm trying to make a simple XML --> CSV script, using XSLT. I found that etree seems to "want" a tag to output... Does anyone know a workaround? Yes, I've seen this post: XML to CSV Using XSLT.
See below...
Here's a sample XML data just for reference. My code doesn't even do anything with the data yet, as it was failing to even write a header.
<projects>
<project>
<name>Shockwave</name>
<language>Ruby</language>
<owner>Brian May</owner>
<state>New</state>
<startDate>31/10/2008 0:00:00</startDate>
</project>
<project>
<name>Other</name>
<language>Erlang</language>
<owner>Takashi Miike</owner>
<state> Canceled </state>
<startDate>07/11/2008 0:00:00</startDate>
</project>
</projects>
Here's my script:
import sys
from lxml import etree
system_file = sys.argv[1]
xml_file = sys.argv[2]
sys_txt = open( system_file,"r" ).read()
xsl_txt = open( "csv_file.xslt","r" ).read()
sysroot = etree.fromstring( sys_txt )
xslroot = etree.fromstring( xsl_txt )
transform = etree.XSLT( xslroot )
with open( xml_file, "w" ) as f:
f.write(etree.tostring( transform(sysroot) ) )
This XSLT code does NOT work ( etree.tostring... = None ):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
Hi
</xsl:template>
</xsl:stylesheet>
But THIS XSLT does work... seems etree needs to output an XML file?
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<dummy>
Hi
</dummy>
</xsl:template>
</xsl:stylesheet>
At this point I'm thinking I can proceed with a dummy tag, then remove it at end...
"Python etree XSLT Requires Tag output?"
The answer is NO.
As exemplified in the documentation, section XSLT result objects; you can use standard python str() function to get the expected string representation of the transformation result, especially when it has no root element :
from lxml import etree
raw_xml = '''<projects>
<project>
<name>Shockwave</name>
<language>Ruby</language>
<owner>Brian May</owner>
<state>New</state>
<startDate>31/10/2008 0:00:00</startDate>
</project>
<project>
<name>Other</name>
<language>Erlang</language>
<owner>Takashi Miike</owner>
<state>Canceled</state>
<startDate>07/11/2008 0:00:00</startDate>
</project>
</projects>'''
raw_xslt = '''<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:text>Hi</xsl:text>
</xsl:template>
</xsl:stylesheet>'''
sysroot = etree.fromstring(raw_xml)
xslroot = etree.fromstring(raw_xslt)
transform = etree.XSLT(xslroot)
print str(transform(sysroot))
# output:
# Hi
And as you saw, etree.tostring() is still usable when the transformation result has a root element.

Modify xml using python

I have an xml file already generated by python and it looks like this. It has multiple items.
xml_screenshot
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>title-name-xyz</title>
<link>http://dist.stage.xyzgauri.com/qa/partner/mac.xml</link>
<description>Most recent changes</description>
<language>en</language>
<item>
<title>Version 3.0.22.4</title>
<sparkle:releaseNotesLink>
https://dist.stage.xyzgauri.com.com/qa/partner/mac_notes.html
</sparkle:releaseNotesLink>
<pubDate>Thu, 12 Nov 2015 04:38:23 -0000</pubDate>
<enclosure
url="https://dist.stage.xyzgauri.com/qa/sandisk/InstallCloud.3.0.22.4.pkg"
sparkle:version="3.0.22.4"
sparkle:shortVersionString="3.0.22"
openlength="30455215"
type="application/octet-stream"
sparkle:dsaSignature="MCwCFHvf7peesvwR0AhRbZxTViLarxcjfd758mHPbnOW6wA=="
sparkle:status="live"
/>
<item>
<title>Version 3.0.10.4</title>
<sparkle:releaseNotesLink>
http://dist.stage.xyzgauri.com/qa/partner/mac_notes.html
</sparkle:releaseNotesLink>
<pubDate>Tue, 03 Nov 2015 04:31:18 -0000</pubDate>
<enclosure
url="http://dist.stage.xyzgauri.com/qa/partner/InstallCloud.3.0.10.4.pkg"
sparkle:version="3.0.10.4"
sparkle:shortVersionString="3.0.10"
openlength="29709636"
type="application/octet-stream"
sparkle:dsaSignature="MCwCFDPvLPr7lYkrx5L5XCDbhXYqrFkGzLtLePK6ng=="
sparkle:status="live"
/>
I need to use python to change the sparkle:status from "live" to "expired" for the older version 3.0.10.4. This xml is later pushed to S3.
I am a newbie to python and hence wondering how to implement this. I can even create a whole new jenkins jobs to get this xml and modify it and then push to S3.
Any help is appreciated.
Thanks.
Consider an XSLT solution using lxml package where you can avoid any looping through all elements as may be required of an XPath solution. The script here runs an identity transform to copy all nodes and attributes as is and then runs a template specifically on all instances of the attribute #sparkle:status where its sibling in attribute set #sparkle:version='3.0.10.4'. Note too I had to declare the sparkle namespace in XSLT's header.
Below loads the XSLT script as a string but you can parse it from external file (saved in .xsl or .xslt format) like you do your XML file.
import lxml.etree as ET
# LOAD XML LAND XSL
dom = ET.parse('Input.xml')
xslstr='''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="enclosure[#sparkle:version='3.0.10.4']/#sparkle:status">
<xsl:attribute name="sparkle:status">expired</xsl:attribute>
</xsl:template>
</xsl:transform>'''
xslt = ET.fromstring(xslstr)
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# SAVE OUTPUT
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

Removing all XML elements that belong to specific namespace

I am an XML beginner. I am using lxml python libs to process a SAML document, however my question is not really related to SAML or SSO.
Quite Simply I need to remove all elements that exist in this XML document which belong to the "ds" namespace. I looked at an Xpath Search, I looked at findall() however I do not know how to work with namespaces.
The original document looks like this:
<Response IssueInstant="dateandtime" ID="redacted" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:protocol" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<saml:Issuer>redacted.com</saml:Issuer>
<Status>
<StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
</Status>
<saml:Assertion Version="2.0" IssueInstant="redacted" ID="redacted">
<saml:Issuer>redacted</saml:Issuer>
<ds:Signature>
<ds:SignedInfo>
<ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
<ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/>
<ds:Reference URI="#redacted">
<ds:Transforms>
<ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/>
<ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
</ds:Transforms>
<ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
<ds:DigestValue>redacted</ds:DigestValue>
</ds:Reference>
</ds:SignedInfo>
<ds:SignatureValue>redacted==</ds:SignatureValue>
<ds:KeyInfo>
<ds:X509Data>
<ds:X509Certificate>certificateredacted=</ds:X509Certificate>
</ds:X509Data>
<ds:KeyValue>
<ds:RSAKeyValue>
<ds:Modulus>modulusredacted==</ds:Modulus>
<ds:Exponent>AQAB</ds:Exponent>
</ds:RSAKeyValue>
</ds:KeyValue>
</ds:KeyInfo>
</ds:Signature>
<saml:Subject>
<saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">subject_redacted</saml:NameID>
<saml:SubjectConfirmation Method="urn:oasis:names:tc:SAML:2.0:cm:bearer">
<saml:SubjectConfirmationData NotOnOrAfter="date_time_redacted" Recipient="https://website.com/redacted"/>
</saml:SubjectConfirmation>
</saml:Subject>
<saml:Conditions NotOnOrAfter="date_time_redacted" NotBefore="date_time_redacted">
<saml:AudienceRestriction>
<saml:Audience>audience_redacted</saml:Audience>
</saml:AudienceRestriction>
</saml:Conditions>
<saml:AuthnStatement AuthnInstant="date_time_redacted" SessionIndex="date_time_redacted">
<saml:AuthnContext>
<saml:AuthnContextClassRef>urn:oasis:names:tc:SAML:2.0:ac:classes:unspecified</saml:AuthnContextClassRef>
</saml:AuthnContext>
</saml:AuthnStatement>
<saml:AttributeStatement xmlns:xs="http://www.w3.org/2001/XMLSchema">
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">attribute=redacted</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">value_redacted</saml:AttributeValue>
</saml:Attribute>
</saml:AttributeStatement>
</saml:Assertion>
</Response>
What I want is a document that looks like this:
<Response IssueInstant="dateandtime" ID="redacted" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:protocol" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<saml:Issuer>redacted.com</saml:Issuer>
<Status>
<StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
</Status>
<saml:Assertion Version="2.0" IssueInstant="redacted" ID="redacted">
<saml:Issuer>redacted</saml:Issuer>
<saml:Subject>
<saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">subject_redacted</saml:NameID>
<saml:SubjectConfirmation Method="urn:oasis:names:tc:SAML:2.0:cm:bearer">
<saml:SubjectConfirmationData NotOnOrAfter="date_time_redacted" Recipient="https://website.com/redacted"/>
</saml:SubjectConfirmation>
</saml:Subject>
<saml:Conditions NotOnOrAfter="date_time_redacted" NotBefore="date_time_redacted">
<saml:AudienceRestriction>
<saml:Audience>audience_redacted</saml:Audience>
</saml:AudienceRestriction>
</saml:Conditions>
<saml:AuthnStatement AuthnInstant="date_time_redacted" SessionIndex="date_time_redacted">
<saml:AuthnContext>
<saml:AuthnContextClassRef>urn:oasis:names:tc:SAML:2.0:ac:classes:unspecified</saml:AuthnContextClassRef>
</saml:AuthnContext>
</saml:AuthnStatement>
<saml:AttributeStatement xmlns:xs="http://www.w3.org/2001/XMLSchema">
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">attribute=redacted</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">value_redacted</saml:AttributeValue>
</saml:Attribute>
</saml:AttributeStatement>
</saml:Assertion>
</Response>
You can find elements in a namespace using XPath with //namespace:*, as such:
doc_root.xpath('//ds:*', namespaces={'ds': 'http://www.w3.org/2000/09/xmldsig#'})
Thus, to remove all children in this namespace, you could use something like the following:
def strip_dsig(doc_root):
nsmap={'ds': 'http://www.w3.org/2000/09/xmldsig#'}
for element in doc_root.xpath('//ds:*', namespaces=nsmap):
element.getparent().remove(element)
return doc_root
This is very easy to do with an xsl stylesheet. This is probably your best approach.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
exclude-result-prefixes="ds">
<!-- no_ds.xsl -->
<xsl:template match="node()|#*">
<xsl:copy><xsl:apply-templates select="node()|#*"/></xsl:copy>
</xsl:template>
<xsl:template match="ds:*"><xsl:apply-templates select="*"/></xsl:template>
<xsl:template match="#ds:*"/>
</xsl:stylesheet>
You can run this from a command line using xsltproc (for libxml2) or equivalent:
xsltproc -o directoryname/ no_ds.xsl file1.xml file2.xml
This will create directoryname/file1.xml and directoryname/file2.xml without the ds namespace.
You can also do this with lxml using lxml's libxslt2 bindings.
no_ds_stylesheet = etree.parse('no_ds.xsl')
no_ds_transform = etree.XSLT()
# doc_to_transform is an Element or ElementTree
# from etree.fromstring(), etree.XML(), or etree.parse()
no_ds_doc = no_ds_transform(doc_to_transform)
#no_ds_doc is now another ElementTree doc, the result of the XSLT transform.
#You can reuse the no_ds_transform object multiple times (and should if you can)
no_ds_doc2 = no_ds_transform(doc_to_transform2)
Since XSLT documents are XML documents, you can even create a custom XSLT stylesheet on the fly using lxml and define the namespaces you want to omit dynamically. (Left as an exercise for the reader.)

Categories

Resources