Python etree XSLT Requires Tag output? - python

I'm trying to make a simple XML --> CSV script, using XSLT. I found that etree seems to "want" a tag to output... Does anyone know a workaround? Yes, I've seen this post: XML to CSV Using XSLT.
See below...
Here's a sample XML data just for reference. My code doesn't even do anything with the data yet, as it was failing to even write a header.
<projects>
<project>
<name>Shockwave</name>
<language>Ruby</language>
<owner>Brian May</owner>
<state>New</state>
<startDate>31/10/2008 0:00:00</startDate>
</project>
<project>
<name>Other</name>
<language>Erlang</language>
<owner>Takashi Miike</owner>
<state> Canceled </state>
<startDate>07/11/2008 0:00:00</startDate>
</project>
</projects>
Here's my script:
import sys
from lxml import etree
system_file = sys.argv[1]
xml_file = sys.argv[2]
sys_txt = open( system_file,"r" ).read()
xsl_txt = open( "csv_file.xslt","r" ).read()
sysroot = etree.fromstring( sys_txt )
xslroot = etree.fromstring( xsl_txt )
transform = etree.XSLT( xslroot )
with open( xml_file, "w" ) as f:
f.write(etree.tostring( transform(sysroot) ) )
This XSLT code does NOT work ( etree.tostring... = None ):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
Hi
</xsl:template>
</xsl:stylesheet>
But THIS XSLT does work... seems etree needs to output an XML file?
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<dummy>
Hi
</dummy>
</xsl:template>
</xsl:stylesheet>
At this point I'm thinking I can proceed with a dummy tag, then remove it at end...

"Python etree XSLT Requires Tag output?"
The answer is NO.
As exemplified in the documentation, section XSLT result objects; you can use standard python str() function to get the expected string representation of the transformation result, especially when it has no root element :
from lxml import etree
raw_xml = '''<projects>
<project>
<name>Shockwave</name>
<language>Ruby</language>
<owner>Brian May</owner>
<state>New</state>
<startDate>31/10/2008 0:00:00</startDate>
</project>
<project>
<name>Other</name>
<language>Erlang</language>
<owner>Takashi Miike</owner>
<state>Canceled</state>
<startDate>07/11/2008 0:00:00</startDate>
</project>
</projects>'''
raw_xslt = '''<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:text>Hi</xsl:text>
</xsl:template>
</xsl:stylesheet>'''
sysroot = etree.fromstring(raw_xml)
xslroot = etree.fromstring(raw_xslt)
transform = etree.XSLT(xslroot)
print str(transform(sysroot))
# output:
# Hi
And as you saw, etree.tostring() is still usable when the transformation result has a root element.

Related

Does the XPath collection () function work with lxml and XSLT?

I tried recently to transform an XML file with the lxml package and a XSL stylesheet containing a variable with XPath collection() function however I get the following error when i'm running my code:
lxml.etree.XSLTApplyError: Failed to evaluate the expression of variable 'name'.
Here are the details of my files:
XML source : catalog.xml
<?xml version="1.0" encoding="UTF-8"?>
<collection>
<doc href="./IR_041698.xml"/>
<doc href="./IR_051379.xml"/>
</collection>
XSL file : test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:tei="http://www.tei-c.org/ns/1.0"
exclude-result-prefixes="tei">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes" method="xml"/>
<xsl:template match="/">
<xsl:variable name="name" select="collection('catalog.xml')/descendant::archdesc/did/origination/persname/text()"/>
<teiHeader xmlns="http://www.tei-c.org/ns/1.0">
<fileDesc>
<titleStmt>
<title>
<xsl:value-of select="$name"/>
</title>
</titleStmt>
</fileDesc>
</teiHeader>
</xsl:template>
</xsl:stylesheet>
Python code :
from lxml import etree as ET
source = ET.parse("catalog.xml")
xslt = ET.parse("test.xsl")
transform = ET.XSLT(xslt)
newdom = transform(source)
print(ET.tostring(newdom, pretty_print=True))
I am a little bit surprised because when I launched the transformation under Oxygen XML editor it works but not in Python.
Do you have any suggestions ? Is XPath collection() function a problem with lxml?
thank you in advance
The collection function is part of XPath and XSLT 2 and later and as such not supported by lxml. You can however, in XSLT, use the document function as document(document('catalog.xml')/*/doc/#href)) to select the "collection" of documents selected by the href attributes of the doc element nodes in the catalog.xml document.
Saxon 9.9 is also available as a Python module https://www.saxonica.com/saxon-c/doc/html/saxonc.html as part of Saxon C 1.2.1 (Download http://saxonica.com/download/c.xml, documentation: http://www.saxonica.com/saxon-c/documentation/index.html) so you might consider switching from lxml to Saxon-C if you want to use XSLT 3 in Python.

Parsing an xml file using lxml

I'm trying to edit an xml file by finding each Watts tag and changing the text in it. So far I've managed to change all tags, but not the Watts tag specifically.
My parser is:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
This keeps my cycling.xml file unchanged. A snippet from output.xml (which is also the cycling.xml file since this is unchanged) is:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
If I change my parser to change all tags with:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
Then my output.xml file becomes:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
How can I change just the Watts tag?
I don't understand what the root = tree.getroot() does. I just thought I'd ask this question at the same time, although I'm not sure it matters in my particular problem.
Your document defines a default XML namespace. Look at the xmlns= attribute at the end of the opening tag:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
This means there is no element named "Watts" in your document; you will need to qualify tag names with the appropriate namespace. If you print out the value of watt.tag in our loop, you will see:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
With this in mind, you can modify your filter so that it looks like
this:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
You can read more about namespace handling in the lxml documentation.
Alternatively, since you use two important words edit xml and you are using lxml, consider XSLT (the XML transformation language) where you can define a namespace prefix and change Watts anywhere in document without looping. Plus, you can pass values into XSLT from Python!
XSLT (save as .xsl file)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Pull out sections from XML in Python

Please note that I have some Python experience but not a lot of deep experience so please bear with me.
I have a very large XML file, ~100 megs, that has many, many sections and subsections. I need to pull out each subsection of a certain type (and there are a lot with this type) and write each to a different file. The writing I can handle, but I'm staring at ElementTree documentation trying to make sense of how to traverse the tree, find an element declared this way, yank out just the data between those tags and process it, then continue down the file.
The structure is similar to this (slightly obfuscated). What I want to do is pull out each section labeled "data" individually.
<filename>
<config>
<collections>
<datas>
<data>
...
</data>
<data>
...
</data>
<data>
...
</data>
</datas>
</collections>
</config>
</filename>
I think you can read in each data element using iterparse and then write it out, the following simply prints the element using the print function but you could of course instead write it to a file:
import xml.etree.ElementTree as ET
for event, elem in ET.iterparse("input.xml"):
if elem.tag == 'data':
print(ET.tostring(elem, 'UTF-8', 'xml'))
elem.clear()
Consider an XSLT solution with Python's third-party module, lxml. Specifically, you xpath() for the length of <data> nodes and then iteratively build a dynamic XSLT script parsing only the needed element by node index [#] for outputted individual XML files:
import lxml.etree as et
dom = et.parse('Input.xml')
datalen = len(dom.xpath("//data"))
for i in range(1, datalen+1):
xsltstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="datas">
<xsl:apply-templates select="data[{0}]" />
</xsl:template>
<xsl:template match="data[{0}]">
<xsl:copy>
<xsl:copy-of select="*"/>
</xsl:copy>
</xsl:template>
</xsl:transform>'''.format(i)
xslt = et.fromstring(xsltstr)
transform = et.XSLT(xslt)
newdom = transform(dom)
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True,
xml_declaration=True)
xmlfile = open('Data{}.xml', 'wb')
xmlfile.write(tree_out)
xmlfile.close()

Modify xml using python

I have an xml file already generated by python and it looks like this. It has multiple items.
xml_screenshot
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>title-name-xyz</title>
<link>http://dist.stage.xyzgauri.com/qa/partner/mac.xml</link>
<description>Most recent changes</description>
<language>en</language>
<item>
<title>Version 3.0.22.4</title>
<sparkle:releaseNotesLink>
https://dist.stage.xyzgauri.com.com/qa/partner/mac_notes.html
</sparkle:releaseNotesLink>
<pubDate>Thu, 12 Nov 2015 04:38:23 -0000</pubDate>
<enclosure
url="https://dist.stage.xyzgauri.com/qa/sandisk/InstallCloud.3.0.22.4.pkg"
sparkle:version="3.0.22.4"
sparkle:shortVersionString="3.0.22"
openlength="30455215"
type="application/octet-stream"
sparkle:dsaSignature="MCwCFHvf7peesvwR0AhRbZxTViLarxcjfd758mHPbnOW6wA=="
sparkle:status="live"
/>
<item>
<title>Version 3.0.10.4</title>
<sparkle:releaseNotesLink>
http://dist.stage.xyzgauri.com/qa/partner/mac_notes.html
</sparkle:releaseNotesLink>
<pubDate>Tue, 03 Nov 2015 04:31:18 -0000</pubDate>
<enclosure
url="http://dist.stage.xyzgauri.com/qa/partner/InstallCloud.3.0.10.4.pkg"
sparkle:version="3.0.10.4"
sparkle:shortVersionString="3.0.10"
openlength="29709636"
type="application/octet-stream"
sparkle:dsaSignature="MCwCFDPvLPr7lYkrx5L5XCDbhXYqrFkGzLtLePK6ng=="
sparkle:status="live"
/>
I need to use python to change the sparkle:status from "live" to "expired" for the older version 3.0.10.4. This xml is later pushed to S3.
I am a newbie to python and hence wondering how to implement this. I can even create a whole new jenkins jobs to get this xml and modify it and then push to S3.
Any help is appreciated.
Thanks.
Consider an XSLT solution using lxml package where you can avoid any looping through all elements as may be required of an XPath solution. The script here runs an identity transform to copy all nodes and attributes as is and then runs a template specifically on all instances of the attribute #sparkle:status where its sibling in attribute set #sparkle:version='3.0.10.4'. Note too I had to declare the sparkle namespace in XSLT's header.
Below loads the XSLT script as a string but you can parse it from external file (saved in .xsl or .xslt format) like you do your XML file.
import lxml.etree as ET
# LOAD XML LAND XSL
dom = ET.parse('Input.xml')
xslstr='''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:sparkle="http://www.andymatuschak.org/xml-namespaces/sparkle">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="enclosure[#sparkle:version='3.0.10.4']/#sparkle:status">
<xsl:attribute name="sparkle:status">expired</xsl:attribute>
</xsl:template>
</xsl:transform>'''
xslt = ET.fromstring(xslstr)
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# SAVE OUTPUT
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()

lxml (or lxml.html): print tree structure

I'd like to print out the tree structure of an etree (formed from an html document) in a differentiable way (means that two etrees should print out differently).
What I mean by structure is the "shape" of the tree, which basically means all the tags but no attribute and no text content.
Any idea? Is there something in lxml to do that?
If not, I guess I have to iterate through the whole tree and construct a string from that. Any idea how to represent the tree in a compact way? (the "compact" feature is less relevant)
FYI it is not intended to be looked at, but to be stored and hashed to be able to make differences between several html templates.
Thanks
Maybe just run some XSLT over the source XML to strip everything but the tags, it's then easy enough to use etree.tostring to get a string you could hash...
from lxml import etree as ET
def pp(e):
print ET.tostring(e, pretty_print=True)
print
root = ET.XML("""\
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8" />
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
""")
pp(root)
xslt = ET.XML("""\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
""")
tr = ET.XSLT(xslt)
doc2 = tr(root)
root2 = doc2.getroot()
pp(root2)
Gives you the output:
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8"/>
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
<project>
<livefolder/>
<livefolder/>
<preference-set>
<boolean/>
</preference-set>
</project>

Categories

Resources