Does the XPath collection () function work with lxml and XSLT? - python

I tried recently to transform an XML file with the lxml package and a XSL stylesheet containing a variable with XPath collection() function however I get the following error when i'm running my code:
lxml.etree.XSLTApplyError: Failed to evaluate the expression of variable 'name'.
Here are the details of my files:
XML source : catalog.xml
<?xml version="1.0" encoding="UTF-8"?>
<collection>
<doc href="./IR_041698.xml"/>
<doc href="./IR_051379.xml"/>
</collection>
XSL file : test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:tei="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="tei">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes" method="xml"/>
<xsl:template match="/">
<xsl:variable name="name" select="collection('catalog.xml')/descendant::archdesc/did/origination/persname/text()"/>
<teiHeader xmlns="http://www.tei-c.org/ns/1.0">
<fileDesc>
<titleStmt>
<title>
<xsl:value-of select="$name"/>
</title>
</titleStmt>
</fileDesc>
</teiHeader>
</xsl:template>
</xsl:stylesheet>
Python code :
from lxml import etree as ET
source = ET.parse("catalog.xml")
xslt = ET.parse("test.xsl")
transform = ET.XSLT(xslt)
newdom = transform(source)
print(ET.tostring(newdom, pretty_print=True))
I am a little bit surprised because when I launched the transformation under Oxygen XML editor it works but not in Python.
Do you have any suggestions ? Is XPath collection() function a problem with lxml?
thank you in advance

The collection function is part of XPath and XSLT 2 and later and as such not supported by lxml. You can however, in XSLT, use the document function as document(document('catalog.xml')/*/doc/#href)) to select the "collection" of documents selected by the href attributes of the doc element nodes in the catalog.xml document.
Saxon 9.9 is also available as a Python module https://www.saxonica.com/saxon-c/doc/html/saxonc.html as part of Saxon C 1.2.1 (Download http://saxonica.com/download/c.xml, documentation: http://www.saxonica.com/saxon-c/documentation/index.html) so you might consider switching from lxml to Saxon-C if you want to use XSLT 3 in Python.

Related

Putting Namespaces into Different XML Tags in Python

I have an xml file in tmp/Program.ev3p:
<?xml version="1.0" encoding="utf-8"?>
<SourceFile Version="1.0.2.10" xmlns="http://www.ni.com/SourceModel.xsd">
<Namespace Name="Project">
<VirtualInstrument IsTopLevel="false" IsReentrant="false" Version="1.0.2.0" OverridingModelDefinitionType="X3VIDocument" xmlns="http://www.ni.com/VirtualInstrument.xsd">
<FrontPanel>
<fpruntime:FrontPanelCanvas xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:fpruntime="clr-namespace:NationalInstruments.LabVIEW.FrontPanelRuntime;assembly=NationalInstruments.LabVIEW.FrontPanelRuntime" xmlns:Model="clr-namespace:NationalInstruments.SourceModel.Designer;assembly=NationalInstruments.SourceModel" x:Name="FrontPanel" Model:DesignerSurfaceProperties.CanSnapToObjects="True" Model:DesignerSurfaceProperties.SnapToObjects="True" Model:DesignerSurfaceProperties.ShowSnaplines="True" Model:DesignerSurfaceProperties.ShowControlAdorners="True" Width="640" Height="480" />
</FrontPanel>
<BlockDiagram Name="__RootDiagram__">
<StartBlock Id="n1" Bounds="0 0 70 91" Target="X3\.Lib:StartBlockTest">
<ConfigurableMethodTerminal>
<Terminal Id="Result" Direction="Output" DataType="Boolean" Hotspot="0.5 1" Bounds="0 0 0 0" />
</ConfigurableMethodTerminal>
<Terminal Id="SequenceOut" Direction="Output" DataType="NationalInstruments:SourceModel:DataTypes:X3SequenceWireDataType" Hotspot="1 0.5" Bounds="52 33 18 18" />
</StartBlock>
</BlockDiagram>
</VirtualInstrument>
</Namespace>
</SourceFile>
I am trying to modify it with the following code:
import xml.etree.ElementTree as ET
tree = ET.parse('tmp/Program.ev3p')
root = tree.getroot()
namespaces = {'http://www.ni.com/SourceModel.xsd': '' ,
'http://www.ni.com/VirtualInstrument.xsd':'',
'http://schemas.microsoft.com/winfx/2006/xaml/presentation':'',
'http://schemas.microsoft.com/winfx/2006/xaml':'x',
'clr-namespace:NationalInstruments.LabVIEW.FrontPanelRuntime;assembly=NationalInstruments.LabVIEW.FrontPanelRuntime':'fpruntime',
'clr-namespace:NationalInstruments.SourceModel.Designer;assembly=NationalInstruments.SourceModel': 'Model',
}
for uri, prefix in namespaces.items():
ET._namespace_map[uri] = prefix
diagram = root[0][0][1]
elem = ET.Element('Data')
diagram.append(elem)
tree.write('tmp/Program.ev3p',"UTF-8",xml_declaration=True)
After running the code, my xml file contains:
<?xml version='1.0' encoding='UTF-8'?>
<SourceFile xmlns="http://www.ni.com/SourceModel.xsd" xmlns="http://www.ni.com/VirtualInstrument.xsd" xmlns:Model="clr-namespace:NationalInstruments.SourceModel.Designer;assembly=NationalInstruments.SourceModel" xmlns:fpruntime="clr-namespace:NationalInstruments.LabVIEW.FrontPanelRuntime;assembly=NationalInstruments.LabVIEW.FrontPanelRuntime" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" Version="1.0.2.10">
<Namespace Name="Project">
<VirtualInstrument IsReentrant="false" IsTopLevel="false" OverridingModelDefinitionType="X3VIDocument" Version="1.0.2.0">
<FrontPanel>
<fpruntime:FrontPanelCanvas Height="480" Width="640" Model:DesignerSurfaceProperties.CanSnapToObjects="True" Model:DesignerSurfaceProperties.ShowControlAdorners="True" Model:DesignerSurfaceProperties.ShowSnaplines="True" Model:DesignerSurfaceProperties.SnapToObjects="True" x:Name="FrontPanel" />
</FrontPanel>
<BlockDiagram Name="__RootDiagram__">
<StartBlock Bounds="0 0 70 91" Id="n1" Target="X3\.Lib:StartBlockTest">
<ConfigurableMethodTerminal>
<Terminal Bounds="0 0 0 0" DataType="Boolean" Direction="Output" Hotspot="0.5 1" Id="Result" />
</ConfigurableMethodTerminal>
<Terminal Bounds="52 33 18 18" DataType="NationalInstruments:SourceModel:DataTypes:X3SequenceWireDataType" Direction="Output" Hotspot="1 0.5" Id="SequenceOut" />
</StartBlock>
<Data /></BlockDiagram>
</VirtualInstrument>
</Namespace>
</SourceFile>
I need the namespaces to be in the tags they were registered in the original file instead of having all of them inside SourceFile, is it possible to achieve so in python?
Currently, the undocumented ElementTree._namespace_map[uri] = prefix is the older Python version (< 1.3) for namespace assignment to the more current, documented ElementTree.register_namespace(prefix, uri). But even this method does not resolve the root issue and docs emphasize this assignment applies globally and replaces any previous namespace or prefix:
xml.etree.ElementTree.register_namespace(prefix, uri)
Registers a namespace prefix. The registry is global, and any existing
mapping for either the given prefix or the namespace URI will be
removed. prefix is a namespace prefix. uri is a namespace uri. Tags
and attributes in this namespace will be serialized with the given
prefix, if at all possible.
To achieve your desired result and because your XML is a bit complex with multiple default and non-default namespaces, consider XSLT, the special-purpose language to transform XML files. Python can run XSLT 1.0 scripts with the third-party module, lxml (not built-in etree). Additionally, XSLT is portable so very code can run in other languages (Java, C#, PHP, VB) and dedicated processors (e.g., Saxon, Xalan).
Specifically, you can use a temporary prefix like doc to map the default namespace of lowest level parent, VirtualInstrument and use this prefix to identify the needed nodes. All other elements are copied over as is with the identity transform template. Also, because you are adding an element to the default namespace you can assign it with the xsl:element tag.
XSLT (save below as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.ni.com/VirtualInstrument.xsd">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:BlockDiagram">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
<xsl:element name="Data" namespace="http://www.ni.com/VirtualInstrument.xsd"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM INPUT
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT RESULT TREE TO CONSOLE
print(newdom)
# SAVE RESULT TREE AS XML
with open('Output.xml','wb') as f:
f.write(newdom)
XSLT Demo

Parsing an xml file using lxml

I'm trying to edit an xml file by finding each Watts tag and changing the text in it. So far I've managed to change all tags, but not the Watts tag specifically.
My parser is:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
This keeps my cycling.xml file unchanged. A snippet from output.xml (which is also the cycling.xml file since this is unchanged) is:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
If I change my parser to change all tags with:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
Then my output.xml file becomes:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
How can I change just the Watts tag?
I don't understand what the root = tree.getroot() does. I just thought I'd ask this question at the same time, although I'm not sure it matters in my particular problem.
Your document defines a default XML namespace. Look at the xmlns= attribute at the end of the opening tag:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
This means there is no element named "Watts" in your document; you will need to qualify tag names with the appropriate namespace. If you print out the value of watt.tag in our loop, you will see:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
With this in mind, you can modify your filter so that it looks like
this:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
You can read more about namespace handling in the lxml documentation.
Alternatively, since you use two important words edit xml and you are using lxml, consider XSLT (the XML transformation language) where you can define a namespace prefix and change Watts anywhere in document without looping. Plus, you can pass values into XSLT from Python!
XSLT (save as .xsl file)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Pull out sections from XML in Python

Please note that I have some Python experience but not a lot of deep experience so please bear with me.
I have a very large XML file, ~100 megs, that has many, many sections and subsections. I need to pull out each subsection of a certain type (and there are a lot with this type) and write each to a different file. The writing I can handle, but I'm staring at ElementTree documentation trying to make sense of how to traverse the tree, find an element declared this way, yank out just the data between those tags and process it, then continue down the file.
The structure is similar to this (slightly obfuscated). What I want to do is pull out each section labeled "data" individually.
<filename>
<config>
<collections>
<datas>
<data>
...
</data>
<data>
...
</data>
<data>
...
</data>
</datas>
</collections>
</config>
</filename>
I think you can read in each data element using iterparse and then write it out, the following simply prints the element using the print function but you could of course instead write it to a file:
import xml.etree.ElementTree as ET
for event, elem in ET.iterparse("input.xml"):
if elem.tag == 'data':
print(ET.tostring(elem, 'UTF-8', 'xml'))
elem.clear()
Consider an XSLT solution with Python's third-party module, lxml. Specifically, you xpath() for the length of <data> nodes and then iteratively build a dynamic XSLT script parsing only the needed element by node index [#] for outputted individual XML files:
import lxml.etree as et
dom = et.parse('Input.xml')
datalen = len(dom.xpath("//data"))
for i in range(1, datalen+1):
xsltstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="datas">
<xsl:apply-templates select="data[{0}]" />
</xsl:template>
<xsl:template match="data[{0}]">
<xsl:copy>
<xsl:copy-of select="*"/>
</xsl:copy>
</xsl:template>
</xsl:transform>'''.format(i)
xslt = et.fromstring(xsltstr)
transform = et.XSLT(xslt)
newdom = transform(dom)
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True,
xml_declaration=True)
xmlfile = open('Data{}.xml', 'wb')
xmlfile.write(tree_out)
xmlfile.close()

Modify xml using python

I have an xml file already generated by python and it looks like this. It has multiple items.
xml_screenshot
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>title-name-xyz</title>
<link>http://dist.stage.xyzgauri.com/qa/partner/mac.xml</link>
<description>Most recent changes</description>
<language>en</language>
<item>
<title>Version 3.0.22.4</title>
<sparkle:releaseNotesLink>
https://dist.stage.xyzgauri.com.com/qa/partner/mac_notes.html
</sparkle:releaseNotesLink>
<pubDate>Thu, 12 Nov 2015 04:38:23 -0000</pubDate>
<enclosure
url="https://dist.stage.xyzgauri.com/qa/sandisk/InstallCloud.3.0.22.4.pkg"
sparkle:version="3.0.22.4"
sparkle:shortVersionString="3.0.22"
openlength="30455215"
type="application/octet-stream"
sparkle:dsaSignature="MCwCFHvf7peesvwR0AhRbZxTViLarxcjfd758mHPbnOW6wA=="
sparkle:status="live"
/>
<item>
<title>Version 3.0.10.4</title>
<sparkle:releaseNotesLink>
http://dist.stage.xyzgauri.com/qa/partner/mac_notes.html
</sparkle:releaseNotesLink>
<pubDate>Tue, 03 Nov 2015 04:31:18 -0000</pubDate>
<enclosure
url="http://dist.stage.xyzgauri.com/qa/partner/InstallCloud.3.0.10.4.pkg"
sparkle:version="3.0.10.4"
sparkle:shortVersionString="3.0.10"
openlength="29709636"
type="application/octet-stream"
sparkle:dsaSignature="MCwCFDPvLPr7lYkrx5L5XCDbhXYqrFkGzLtLePK6ng=="
sparkle:status="live"
/>
I need to use python to change the sparkle:status from "live" to "expired" for the older version 3.0.10.4. This xml is later pushed to S3.
I am a newbie to python and hence wondering how to implement this. I can even create a whole new jenkins jobs to get this xml and modify it and then push to S3.
Any help is appreciated.
Thanks.
Consider an XSLT solution using lxml package where you can avoid any looping through all elements as may be required of an XPath solution. The script here runs an identity transform to copy all nodes and attributes as is and then runs a template specifically on all instances of the attribute #sparkle:status where its sibling in attribute set #sparkle:version='3.0.10.4'. Note too I had to declare the sparkle namespace in XSLT's header.
Below loads the XSLT script as a string but you can parse it from external file (saved in .xsl or .xslt format) like you do your XML file.
import lxml.etree as ET
# LOAD XML LAND XSL
dom = ET.parse('Input.xml')
xslstr='''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="enclosure[#sparkle:version='3.0.10.4']/#sparkle:status">
<xsl:attribute name="sparkle:status">expired</xsl:attribute>
</xsl:template>
</xsl:transform>'''
xslt = ET.fromstring(xslstr)
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# SAVE OUTPUT
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

Removing all XML elements that belong to specific namespace

I am an XML beginner. I am using lxml python libs to process a SAML document, however my question is not really related to SAML or SSO.
Quite Simply I need to remove all elements that exist in this XML document which belong to the "ds" namespace. I looked at an Xpath Search, I looked at findall() however I do not know how to work with namespaces.
The original document looks like this:
<Response IssueInstant="dateandtime" ID="redacted" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:protocol" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<saml:Issuer>redacted.com</saml:Issuer>
<Status>
<StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
</Status>
<saml:Assertion Version="2.0" IssueInstant="redacted" ID="redacted">
<saml:Issuer>redacted</saml:Issuer>
<ds:Signature>
<ds:SignedInfo>
<ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
<ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/>
<ds:Reference URI="#redacted">
<ds:Transforms>
<ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/>
<ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/>
</ds:Transforms>
<ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
<ds:DigestValue>redacted</ds:DigestValue>
</ds:Reference>
</ds:SignedInfo>
<ds:SignatureValue>redacted==</ds:SignatureValue>
<ds:KeyInfo>
<ds:X509Data>
<ds:X509Certificate>certificateredacted=</ds:X509Certificate>
</ds:X509Data>
<ds:KeyValue>
<ds:RSAKeyValue>
<ds:Modulus>modulusredacted==</ds:Modulus>
<ds:Exponent>AQAB</ds:Exponent>
</ds:RSAKeyValue>
</ds:KeyValue>
</ds:KeyInfo>
</ds:Signature>
<saml:Subject>
<saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">subject_redacted</saml:NameID>
<saml:SubjectConfirmation Method="urn:oasis:names:tc:SAML:2.0:cm:bearer">
<saml:SubjectConfirmationData NotOnOrAfter="date_time_redacted" Recipient="https://website.com/redacted"/>
</saml:SubjectConfirmation>
</saml:Subject>
<saml:Conditions NotOnOrAfter="date_time_redacted" NotBefore="date_time_redacted">
<saml:AudienceRestriction>
<saml:Audience>audience_redacted</saml:Audience>
</saml:AudienceRestriction>
</saml:Conditions>
<saml:AuthnStatement AuthnInstant="date_time_redacted" SessionIndex="date_time_redacted">
<saml:AuthnContext>
<saml:AuthnContextClassRef>urn:oasis:names:tc:SAML:2.0:ac:classes:unspecified</saml:AuthnContextClassRef>
</saml:AuthnContext>
</saml:AuthnStatement>
<saml:AttributeStatement xmlns:xs="http://www.w3.org/2001/XMLSchema">
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">attribute=redacted</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">value_redacted</saml:AttributeValue>
</saml:Attribute>
</saml:AttributeStatement>
</saml:Assertion>
</Response>
What I want is a document that looks like this:
<Response IssueInstant="dateandtime" ID="redacted" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:protocol" xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<saml:Issuer>redacted.com</saml:Issuer>
<Status>
<StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
</Status>
<saml:Assertion Version="2.0" IssueInstant="redacted" ID="redacted">
<saml:Issuer>redacted</saml:Issuer>
<saml:Subject>
<saml:NameID Format="urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified">subject_redacted</saml:NameID>
<saml:SubjectConfirmation Method="urn:oasis:names:tc:SAML:2.0:cm:bearer">
<saml:SubjectConfirmationData NotOnOrAfter="date_time_redacted" Recipient="https://website.com/redacted"/>
</saml:SubjectConfirmation>
</saml:Subject>
<saml:Conditions NotOnOrAfter="date_time_redacted" NotBefore="date_time_redacted">
<saml:AudienceRestriction>
<saml:Audience>audience_redacted</saml:Audience>
</saml:AudienceRestriction>
</saml:Conditions>
<saml:AuthnStatement AuthnInstant="date_time_redacted" SessionIndex="date_time_redacted">
<saml:AuthnContext>
<saml:AuthnContextClassRef>urn:oasis:names:tc:SAML:2.0:ac:classes:unspecified</saml:AuthnContextClassRef>
</saml:AuthnContext>
</saml:AuthnStatement>
<saml:AttributeStatement xmlns:xs="http://www.w3.org/2001/XMLSchema">
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">attribute=redacted</saml:AttributeValue>
</saml:Attribute>
<saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified" Name="attribute_name_redacted">
<saml:AttributeValue xsi:type="xs:string">value_redacted</saml:AttributeValue>
</saml:Attribute>
</saml:AttributeStatement>
</saml:Assertion>
</Response>
You can find elements in a namespace using XPath with //namespace:*, as such:
doc_root.xpath('//ds:*', namespaces={'ds': 'http://www.w3.org/2000/09/xmldsig#'})
Thus, to remove all children in this namespace, you could use something like the following:
def strip_dsig(doc_root):
nsmap={'ds': 'http://www.w3.org/2000/09/xmldsig#'}
for element in doc_root.xpath('//ds:*', namespaces=nsmap):
element.getparent().remove(element)
return doc_root
This is very easy to do with an xsl stylesheet. This is probably your best approach.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
exclude-result-prefixes="ds">
<!-- no_ds.xsl -->
<xsl:template match="node()|#*">
<xsl:copy><xsl:apply-templates select="node()|#*"/></xsl:copy>
</xsl:template>
<xsl:template match="ds:*"><xsl:apply-templates select="*"/></xsl:template>
<xsl:template match="#ds:*"/>
</xsl:stylesheet>
You can run this from a command line using xsltproc (for libxml2) or equivalent:
xsltproc -o directoryname/ no_ds.xsl file1.xml file2.xml
This will create directoryname/file1.xml and directoryname/file2.xml without the ds namespace.
You can also do this with lxml using lxml's libxslt2 bindings.
no_ds_stylesheet = etree.parse('no_ds.xsl')
no_ds_transform = etree.XSLT()
# doc_to_transform is an Element or ElementTree
# from etree.fromstring(), etree.XML(), or etree.parse()
no_ds_doc = no_ds_transform(doc_to_transform)
#no_ds_doc is now another ElementTree doc, the result of the XSLT transform.
#You can reuse the no_ds_transform object multiple times (and should if you can)
no_ds_doc2 = no_ds_transform(doc_to_transform2)
Since XSLT documents are XML documents, you can even create a custom XSLT stylesheet on the fly using lxml and define the namespaces you want to omit dynamically. (Left as an exercise for the reader.)

Categories

Resources