Good XSLT for Python - lxml struggles

Good XSLT for Python - lxml struggles - python

I'm trying to transform XHTML to text using a user-defined XSLT, which is the following:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xpath-default-namespace="http://www.w3.org/1999/xhtml">
<xsl:output method="text"/>
<xsl:template match="/html">
Reading document entitled <xsl:value-of select="head/title"/>.
The top menu for this site has the following options:
<xsl:for-each select="body//ul[#role='menubar']/li/a">
<xsl:value-of select="."></xsl:value-of> <xsl:text>
</xsl:text>
</xsl:for-each>
Now let's read the main part of the page.
<xsl:for-each select="body//main[#class='container']//(h1 | h2 | h3 | h4 | p | ul/li/a)">
<xsl:value-of select="normalize-space(.)"/><xsl:text>
</xsl:text><xsl:text>
</xsl:text>
</xsl:for-each>
The footer menu for this site has the following options:
<xsl:for-each select="body//footer[#id='wb-info']//ul/li/a">
<xsl:value-of select="."></xsl:value-of> <xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When I test in http://xsltransform.net/, applying it a typical HTML, the output is as expected.
I test the same XSLT against the same XHTML using the following Python code:
import lxml.etree as ET
html = ET.parse("../fixed_html/about.html")
xslt = ET.parse("../templates/generic.xslt")
transform = ET.XSLT(xslt)
res = transform(html)
print(res)
I get the following error:
lxml.etree.XSLTParseError: xsl:for-each : could not compile select expression 'body//main[#class='container']//(h1 | h2 | h3 | h4 | p | ul/li/a)'
My first thought is that lxml has limitations. It can't handle valid XSLT. I'm hoping that's not the case, and I just failed to setup the code correctly.
Any issues with the Python code? Can I process the XSLT above in Python some other way?

Your stylesheet declares version="1.0" but the code itself requires an XSLT 2.0 processor:
The xpath-default-namespace attribute is an XSLT 2.0 feature;
In XPath 1.0 parentheses are allowed only in the first location step.
lxml uses the libxslt processor that only supports XSLT 1.0. You will need to rewrite your stylesheet for XSLT 1.0 or find a way to incorporate an XSLT 2.0 or higher processor in your processing chain.
When I test in http://xsltransform.net/, applying it a typical HTML, the output is as expected.
Only when you select the Saxon 9.5.1 engine. With any other processor you will get an error.

XSLT 2 or 3 for Python is supported by Saxonica's SaxonC 11.1 release, done this month, see details at https://www.saxonica.com/download/c.xml and https://www.saxonica.com/saxon-c/documentation11/index.html#!starting.
At the current stage, you need to compile/build the Python module on your own after downloading the source code and the library modules of SaxonC 11.1.

Related

How to edit pmml model file using xml parser

I want to remove some nodes from a pmml file that I generated. So I tried to use xml parser in python:
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('treedemo.pmml')
for inter in tree.findall('DataDictionary'):
print(inter)
It turns out that the print output nothing, which means the xml parser didn't work. the pmml file is here. Suppose I want to delete
<Interval closure="closedClosed" leftMargin="21.0" rightMargin="46.0"/>
from
<DataField name="fk_057_nearcontact_auth_expire_time" optype="continuous" dataType="float">
<Interval closure="closedClosed" leftMargin="21.0" rightMargin="46.0"/>
</DataField>
Can pmml file be edit and modified by python?

Rather than developing custom XML manipulation code, you should learn about an existing technology called XSL Transformations (XSLT).
In brief, you need to create an XSL document, which specifies XML manipulation actions. You can then apply this XSL document to one or more XML documents (including PMML documents) using command-line XSLT tools. For example, on GNU/Linux systems you can use the xsltproc tool.
An XSL document for deleting Interval elements:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:pmml="http://www.dmg.org/PMML-4_2">
<!-- By default, copy all -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- However, in case of the PMML Interval element, take no (copy-) action -->
<xsl:template match="pmml:Interval"/>
</xsl:stylesheet>
Be sure to configure the value of the pmml namespace prefix to match that of your PMML documents. The above example applies to PMML schema version 4.2 documents.
Then, apply the stylesheet to PMML files (command syntax xsltproc <XSL file> <PMML file(s)>):
$ xsltproc --output test-mod.pmml test.xsl test.pmml

In general, it's risky to play with pmml file, just be careful about it.
You can use BeautifulSoup.
for your specific goal, the tag 'Interval' appear only once, so you can find this tag in only one step, and then extract it:
# import BeautifulSoup
from bs4 import BeautifulSoup
# open and read the file
inf = open(r'treedemo.xml', 'r')
txt = inf.read()
inf.close()
# prepare the soup
soup = BeautifulSoup(txt, 'xml')
# now find the tag you want to remove, in this case it's easy, since the tag 'Interval' is unique across your pmml file:
interval = soup.find('Interval')
# remove the tag
interval.extract()
# write the updated pmml file
with open(r'treedemo_clean.xml', "w") as outf:
outf.write(str(soup))
The output will have no indents unless you will use outf.write(str(soup.prettify()))
I will not recommend to use prettify. might mess up the pmml
In case the tag is not unique then you have to find it carefully, in order to avoid deleting the wrong tag and brake your pmml.
There is nothing wrong with the field you want to remove. it shows statistics from your training data set. you can set the flag with_statistics=False

Pretty print subnode without namespace declaration

I have an xml document and I want to extract a subnode (boundedBy) and pretty_print it exactly as it looks in the original document (with exception to the pretty formatting).
<?xml version="1.0" encoding="UTF-8" ?>
<wfs:FeatureCollection
xmlns:sei="https://somedomain.com/namespace"
xmlns:wfs="http://www.opengis.net/wfs"
xmlns:gml="http://www.opengis.net/gml"
xmlns:ogc="http://www.opengis.net/ogc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd
https://somedomain.com/schemas/wfsnamespace some.xsd">
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>-10.934396,-139.997120 77.396455,-53.627763</gml:coordinates>
</gml:Box>
</gml:boundedBy>
<gml:featureMember>
<sei:HUB_HEIGHT_FCST>
<!--- This is the section I want --->
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120 14.574435,-139.997120</gml:coordinates>
</gml:Box>
</gml:boundedBy>
<!--- This is the section I want --->
<sei:geometry_4326>
<gml:Point srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120</gml:coordinates>
</gml:Point>
</sei:geometry_4326>
<sei:rundatetime>2017-09-26 00:00:00</sei:rundatetime>
<sei:validdatetime>2017-09-26 17:00:00</sei:validdatetime>
</sei:HUB_HEIGHT_FCST>
</gml:featureMember>
</wfs:FeatureCollection>
Here is how I'm extracting the subnode:
# parse the xml string
parser = etree.XMLParser(remove_blank_text=True, remove_comments=True, recover=False, strip_cdata=False)
root = etree.fromstring(xmlstr, parser=parser)
#find the subnode I want
subnodes = root.xpath("./gml:boundedBy", namespaces={'gml': 'http://www.opengis.net/gml'})
subnode = subnodes[0]
# make a pretty output
xmlstr = etree.tostring(subnode, xml_declaration=False, encoding="UTF-8", pretty_print=True)
print xmlstr
Which gives me this. Unfortunately lxml is adding the namespaces to the boundedBy node (which makes sense for the sake of completeness in xml).
<gml:boundedBy xmlns:gml="http://www.opengis.net/gml" xmlns:sei="https://somedomain.com/namespace" xmlns:wfs="http://www.opengis.net/wfs" xmlns:ogc="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<gml:Box srsName="EPSG:4326">
<gml:coordinates>-10.934396,-139.997120 77.396455,-53.627763</gml:coordinates>
</gml:Box>
</gml:boundedBy>
I only want the subnode as it looked in the original document.
<gml:boundedBy>
<gml:Box srsName="EPSG:4326">
<gml:coordinates>14.574435,-139.997120 14.574435,-139.997120</gml:coordinates>
</gml:Box>
</gml:boundedBy>
I'm flexible with not using lxml, but either way I haven't found options on how to accomplish this.
edit:
Since it was pointed out that I should explain why I want to do this...
I'm trying to log the xml fragment without altering it's original structure. The automated test I'm building looks at certain nodes for correctness. In the process I'm logging the fragment and want to make it a bit more readable for the person reviewing. Some of the fragments can get fairly large which is why pretty_print is so nice.

Python script to change an attribute value in .tcx file (XML)

I have a .tcx (XML) file, with the following schema:
<Activities>
<Activity>
<Lap StartTime="2015-12-24T08:12:18.969Z">
<TotalTimeSeconds>4069.0</TotalTimeSeconds>
<DistanceMeters>30458.794921875</DistanceMeters>
<MaximumSpeed>43.36123275756836</MaximumSpeed>
<Calories>2286</Calories>
<AverageHeartRateBpm><Value>144</Value></AverageHeartRateBpm><MaximumHeartRateBpm><Value>169</Value></MaximumHeartRateBpm>
<Intensity>Active</Intensity>
<Cadence>87</Cadence>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2015-12-24T08:12:19.969Z</Time>
<Position><LatitudeDegrees>45.4917</LatitudeDegrees><LongitudeDegrees>9.16198</LongitudeDegrees></Position>
<AltitudeMeters>124.018</AltitudeMeters>
<DistanceMeters>0.0</DistanceMeters>
<SensorState>Present</SensorState>
<Extensions><TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2"><Watts>0</Watts></TPX></Extensions></Trackpoint>
...
</Track>
</Lap>
</Activity>
</Activities>
and need to change (double) the Watts attribute.
Would like a simple python script

Simply run an XSLT script. No Python loops or expensive XPaths (//) is needed. As information, XSLT is a declarative, special-purpose programming language used specifically to restructure, redesign, or re-format XML documents to various end use needs. Like most general purpose languages such as Java, C#, Perl, PHP, VB, Python comes equipped with an XSLT 1.0 processor in its lxml module.
Below runs an identity transform to copy entire document as is and then multiplies the current value in any Watts node by 2. I declare a namespace doc in XSLT to reference the Watts element.
XSLT (save as .xsl or .xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:Watts">
<xsl:copy>
<xsl:value-of select=". * 2"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Python Script
import lxml.etree as ET
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('Output.xml')
xmlfile.write(tree_out)
xmlfile.close()

Your last two element tags need to be closing tags, and you have a Watts element not an attribute. Here is how to do it with your file structure.
Python provides the ElementTree library for this. The following script will accomplish what you want:
import xml.etree.ElementTree as ET
tree = ET.parse("test.tcx")
tpxns = "http://www.garmin.com/xmlschemas/ActivityExtension/v2"
for watts in tree.iter("{%s}Watts"%tpxns):
watts.text = str(2*int(watts.text))
tree.write("testnew.tcx")
Here I import the ElementTree library and use a simpler name for it. The parse function creates an ElementTree object from your file. I walk through the file to find all Watts elements (as these occur in a namespace, I actually need to look for {http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts, which I build up using string formatting).
When I find such an element, I set the inner text to be twice what the previous value was (converting to an int first and then back to a string).
Finally, I write the new xml file out. I could have overwrote the original file here if I wanted to.
Look over the documentation for the ElementTree module if you need to do anything fancier. It provides very powerful tools for working with XML. There are even more powerful libraries out there if you need more features (I like lxml for example).

Parse Two XML using Python

I have an XML file that contains 200 Event blocks looking like below:
<?xml version='1.0' encoding='UTF-8'?>
<ProjectData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.project.com/proj1/projv" xsi:schemaLocation="http://www.pp.com/oj/p http://www.onj.com/p/IXX/schema/proj.xsd">
<fileType>This file is sample</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Id>0</Id>
<pp define="something">2</pp>
<Index>3</Index>
<Conf ref="point">CFG.AC.UF</Conf>
<Check>tttt</Check>
<Group>wwll</Group>
<Heart ref="point">mbmb</Heart>
<Name>kkk</Name>
<Thresh ref="point">kckcv</Thresh>
<Hyster ref="point">foo</Hyster>
<Trip ref="point">dim</Trip>
<Clear ref="point">CLR.AC.UF</Clear>
</Event>
</EventList>
</ProjData>
The Event block contains information that I am interested in taking (4 of them only: Id, Index, Name, and Group) to generate my new xml file. I want to do this by python code.
Does anyone know how I can achieve this by python.
My new xml file should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>0</Id>
<Index>3</Index>
<Name>kkk**$Id**</Name>
<Group>wwll**$Index**</Group>
<desc>placeholder</desc>
</event>
</Newevents>
I also want to add the Id and Index which are numbers to name and group strings with three significant digit place holder.
For example if the Id is 1, I want my Name be kkk001, or if Id is 3, I want my Name be kkk003.
Same for my group element string but using Index: if Index is 5 I want my group be wwll005.
I Googled but there are sporadic information about this.
Can anyone come up with a neat python code that do the parsing of my xml file and generate the new xml file in the format and numbering I want above?
I have another xml file called descXML.xml that I need to parse to only take the desc element string and add it to my new xml file.
In the second xml file that I have (descXML.xml), desc element data should be taken based on the Id match with my new xml file.
Is there any possibility to do the check if Id element is equal to the Id element data of my new xml file, then add desc element content for the corresponding code number? How can I do this condition? Can you provide and example python for this?
Here is how descXML.xml file looks like and analogous to my first original xml file here are also 200 Event blocks:
<EventList>
<Event>
<Mnemonic>AC.UF.SLOW</Mnemonic>
<Id define="xyz">3</Id>
<Index>13</Index>
<Description>today was warm and I want to go swimming</Description>
</Event>
<EventList>
Can 1 and 2 above merge into one python file?
The final XML file I want should look like below:
<?xml version="1.0" encoding="UTF-8"?>
<Newevents>
<event>
<Id>3</Id>
<Index>13</Index>
<Name>kkk000</Name>
<Group>wwll003</Group>
<desc>today was warm and I want to go swimming</desc>
</event>
Trial based on comments given below:
I tried to be consise here and try solution provided below but did not work so I am providing my exact xml files:
My file1.xml
<?xml version='1.0' encoding='UTF-8'?>
<Dataizx xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.kklk.com/cx1/ASD" xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd">
<fileType>Auto-Generated IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0</version>
<modified>2015-09-16T17:03:25</modified>
</header>
<EventList>
<Event>
<Mnemonic>ijk</Mnemonic>
<Id define="rece">2</Id>
<Index>0</Index>
<Config ref="point">shine</Config>
</Event>
<Event>
<Mnem>xyz</Mnem>
<Id define="teller">3</Id>
<Index>1</Index>
<Config ref="point">good</Config>
</Event>
</EventList>
</Dataizx>
And here is my xml that contains the description:
<?xml version="1.0" encoding="UTF-8"?>
<IXXData xmlns="http://www.mnm.com/mnm/mnm" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:i="http://www.mnm.com/mnm/mnm" xsi:schemaLocation="http://www.mnm.com/mnm/mnm http://www.mnm.com/mnm/mnm/schema/mnm.xsd">
<fileType>Merged IXX Events Metadata</fileType>
<header>
<fileID>none</fileID>
<version>1.0 + none</version>
<description>Merged event metadata.</description>
</header>
<EventList>
<Event>
<Id define="mmm">2</Id>
<Description>everything was good.</Description>
</Event>
<Event>
<Id define="lll">4</Id>
<Description>teller and the other one.</Description>
</Event>
<Event>
<Id define="ggg">3</Id>
<Description>weather is nice.</Description>
</Event>
</EventList>
</IXXData>
I used your xsl and python but I could not get description out of the second file.

Consider an XLST solution which can pick various nodes from original XML and merge nodes in an external XML based on specific criteria. Python (like many object-oriented programming languages) maintains an XSLT processor like in its lxml module.
As information, XSLT is a special-purpose, declarative programming language (not an object-oriented one) to transform XML files in various formats and structures.
Additionally for your purposes you can use XSLT's document() and concat() functions. Your XSLT was a little involved as it required setting a variable to match ids across documents and had quite a bit of namespaces to manage.
XSLT (save externally as .xsl file)
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.kklk.com/cx1/ASD"
xmlns:i="http://www.sesolar.com/SE1/ICB"
xsi:schemaLocation="http://www.kklk.com/cx1/ASD http://www.kklk.com/cx1/ASD/schema/tell.xsd"
exclude-result-prefixes="xsi p i">
<xsl:output version="1.0" encoding="UTF-8"/>
<xsl:template match="p:EventList">
<NewsEvents>
<xsl:for-each select="p:Event">
<Id><xsl:value-of select="p:Id"/></Id>
<Index><xsl:value-of select="p:Index"/></Index>
<Name><xsl:value-of select="concat(p:Name, '00', p:Id)"/></Name>
<Group><xsl:value-of select="concat(p:Group, '00', p:Index)"/></Group>
<xsl:variable name="descID" select="p:Id"/>
<desc><xsl:value-of select="document('descXML.xml')/i:IcbData/i:EventList/
i:Event/i:Id[text()=$descID]/following-sibling::i:Description"/></desc>
</xsl:for-each>
</NewsEvents>
</xsl:template>
</xsl:transform>
Python (loads .xml and .xsl, transforming former with the latter for new .xml output)
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('C:\\Path\\To\\MainXML.xml')
xslt = ET.parse('C:\\Path\\To\\AboveXSLT.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open('C:\\Path\\To\\Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output (using above posted XML data)
(if descXML's Id matches any Event's Id, corresponding <desc> node below will be populated)
<?xml version='1.0' encoding='UTF-8'?>
<Dataroot>
<NewsEvents>
<Id>2</Id>
<Index>0</Index>
<Name>002</Name>
<Group>000</Group>
<desc>everything was good.</desc>
</NewsEvents>
<NewsEvents>
<Id>3</Id>
<Index>1</Index>
<Name>003</Name>
<Group>001</Group>
<desc>weather is nice.</desc>
</NewsEvents>
</Dataroot>
I know this XSLT approach may look intimidating but it saves much looping and creating elements, subelements, and attributes in Python code. I often recommend this route whenever XML files are being handled and I do find it ignored among programmers not just Pythoners. Meanwhile, most easily work with another special-purpose, declarative language without question -SQL!

Python XML Generation: Avoid multiple ns0 tag without lxml

I have a python script that simply reads "input.xml" and copies into "output.xml" file. As showed in "output.xml", Python's Xpath generates ns0, ns1 tag. How to avoid these tags without using other xml libraries (eg. lxml)?
Script:
import xml.etree.ElementTree as ET
fileName = "input.xml"
tree = ET.parse(template)
tree.write("output.xml")
Input.xml:
<Car>
<brand xmlns = "www.car.com" xmlns:brand="www.bmw.com">
<arg key="name" value="series 3" />
</brand>
<market xmlns = "www.ebay.com">
<arg key="name" value="auto"/>
</market>
</Car>
output.xml:
<Car xmlns:ns0="www.car.com" xmlns:ns1="www.ebay.com">
<ns0:brand>
<ns0:arg key="name" value="series 3" />
</ns0:brand>
<ns1:market>
<ns1:arg key="name" value="auto" />
</ns1:market>
</Car>

I am afraid, that there is no simple solution to this.
There is an issue in Python bug tracker related to that which is not closed for a while.
You could try to follow the solution proposed there, but it does not look very clear.
My recommendation is to reconsider using lxml - it provides real power to XML processing, Google AppEngine is including this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Good XSLT for Python - lxml struggles - python

Related

How to edit pmml model file using xml parser

Pretty print subnode without namespace declaration

Python script to change an attribute value in .tcx file (XML)

Parse Two XML using Python

Python XML Generation: Avoid multiple ns0 tag without lxml

Categories

Resources