Sorting XML files - python

Is it possible to sort XML files like the following:
<model name="ford">
<driver>Bob</driver>
<driver>Alice</driver>
</model>
<model name="audi">
<driver>Carly</driver>
<driver>Dean</driver>
</model>
Which would become
<model name="audi">
<driver>Carly</driver>
<driver>Dean</driver>
</model>
<model name="ford">
<driver>Alice</driver>
<driver>Bob</driver>
</model>
That is, the outermost elements are sorted first, then the second outermost, and so on.
They'd need to be sorted by element name first.

This is a refinement of Kirill's solution, I think it better reflects the stated requirements, and it avoids the type error XSLT 2.0 will give you if the sort key contains more than one value (but it still works on 1.0).
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" />
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:apply-templates select="*">
<xsl:sort select="(#name | text())[1]"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Try this XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" />
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()">
<xsl:sort select="text() | #*"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

You can sort nodes by removing them from the parent node, and re-inserting them in the intended order. For example:
def sort_tree(tree):
""" recursively sorts the given etree in place """
for child in tree:
sort_tree(child)
sorted_children = sorted(tree, key=lambda n: n.text)
for child in tree:
tree.remove(child)
for child in reversed(sorted_children):
tree.insert(0, child)
tree = etree.fromstring(YOUR_XML)
sort_tree(tree)
print(etree.tostring(tree, pretty_print=True))

You don't need to sort the entire xml dom.
Instead take the required nodes into a list and sort them. Because we would need the sorted order while processing and not in file, its better done in run time.
May be like this, using minidom.
import os, sys
from xml.dom import minidom
document = """\
<root>
<model name="ford">
<driver>Bob</driver>
<driver>Alice</driver>
</model><model name="audi">
<driver>Carly</driver>
<driver>Dean</driver>
</model>
</root>
"""
document = minidom.parseString(document)
elements = document.getElementsByTagName("model")
elements.sort(key=lambda elements:elements.attributes['name'])

Related

How to evaluate XSLT processor parameters in a local context?

Let's say I have this source XML:
<A>
<B>something</B>
<B>something else</B>
</A>
and I want to transform it into this target XML:
<C>
<D>something</D>
<D>something else</D>
</C>
The obvious XSL of course is this:
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="A">
<C>
<xsl:for-each select="B">
<D><xsl:value-of select="."/></D>
</xsl:for-each>
</C>
</xsl:template>
</xsl:stylesheet>
Now let's say I don't know the paths I'm going to use beforehand and I want to parametrize them from my processor, which happens to be lxml (in Python).
So I change my XSL into this:
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="path_of_B"/>
<xsl:template match="A">
<C>
<xsl:for-each select="$path_of_B">
<D><xsl:value-of select="."/></D>
</xsl:for-each>
</C>
</xsl:template>
</xsl:stylesheet>
and I call it from Python like this:
source = etree.parse("source.xml")
transform = etree.XSLT(etree.parse("transform.xsl"))
target = transform(source, path_of_B="B")
This doesn't give me the intended result because when I pass the paths from the processor they are always evaluated in a global context, the current() node is always the root, no matter where I use the parameter. Is there any way to evaluate the XPaths in the correct context like they do in the first example where I write them by hand?
I have tried many approaches like
Passing parameters in nested templates, because I thought the evaluation would have the context of the template
Passing the parameters as strings and evaluate them later, but XPath 1.0 doesn't have an eval() function like Python.
Attribute value templates, but it is not allowed on xsl elements
At some point I even touched <xsl:namespace-alias> to dynamically generate my XSL but it was very confusing.
So in the end, I solved it by pre-processing my xsl file with a template engine or string-formatting. It works, but I was just wondering if there is a "pure" XSLT+processor solution.
XPath 1.0 doesn't have an eval() function
No, but the libxslt processor supports the EXSLT dyn:evaluate() extension function - so you could do:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:dyn="http://exslt.org/dynamic"
extension-element-prefixes="dyn">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:param name="path_of_B"/>
<xsl:template match="/A">
<C>
<xsl:for-each select="dyn:evaluate($path_of_B)">
<D>
<xsl:value-of select="."/>
</D>
</xsl:for-each>
</C>
</xsl:template>
</xsl:stylesheet>
If you want to parametrize both your input and output element names you could do something like this.
Although this method would not work well if your source XML's structure is not always the same.
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="e1_input" select="'A'"/>
<xsl:param name="e1_output" select="'A_OUT'"/>
<xsl:param name="e2_input" select="'B'"/>
<xsl:param name="e2_output" select="'B_OUT'"/>
<xsl:template match="/">
<xsl:for-each select="*[name()=$e1_input]">
<xsl:element name="{$e1_output}">
<xsl:for-each select="*[name()=$e2_input]">
<xsl:element name="{$e2_output}">
<xsl:apply-templates/>
</xsl:element>
</xsl:for-each>
</xsl:element>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
See it working here : https://xsltfiddle.liberty-development.net/aKZkh9

Find text accross mutiple lines to replace it with python (xml)

I have the following XML file:
<?xml version='1.0' encoding='utf8'?>
<Products>
<item type="dict">
<id type="int">37475</id>
<name type="str">something_something</name>
<slug type="str">something_something</slug>
<permalink type="str">something_something</permalink>
<date_created type="str">date</date_created>
<date_created_gmt type="str">date</date_created_gmt>
<date_modified type="str">date</date_modified>
<date_modified_gmt type="str">date</date_modified_gmt>
<type type="str">simple</type>
<status type="str">publish</status>
<featured type="bool">False</featured>
<catalog_visibility type="str">visible</catalog_visibility>
<description type="str">something_something</description>
</item>
I started with a JSON that I converted to a XML file so all of the products in that file start with the <item type="dict"> tag, which is not what I want. I would like for all of the products to be enclosed in a <product> tag.
To fix this issue I am doing the following:
tree = ET.ElementTree(root)
xmlstr = ET.tostring(root, encoding='utf8', method='xml') #xml of each product to string so that it can be edited
finalstr = xmlstr.decode("utf-8").replace(' />','') #remove wrong part
finalstr = finalstr.replace('<item type="dict"> <id type="int">','<product> <id type="int">')
This works for other problems in my XML file, but only when they are on one line.
My question is how do I select two or more lines so that I can replace them?
Desired output:
<?xml version='1.0' encoding='utf8'?>
<Products>
<product>
<id type="int">37475</id>
<name type="str">something_something</name>
<slug type="str">something_something</slug>
<permalink type="str">something_something</permalink>
<date_created type="str">date</date_created>
<date_created_gmt type="str">date</date_created_gmt>
<date_modified type="str">date</date_modified>
<date_modified_gmt type="str">date</date_modified_gmt>
<type type="str">simple</type>
<status type="str">publish</status>
<featured type="bool">False</featured>
<catalog_visibility type="str">visible</catalog_visibility>
<description type="str">something_something</description>
</product>
It should be possible with regular expression:
finalstr = re.sub('<item type="dict">[\n\s]*<id type="int">','<product>\n<id type="int">', finalstr)
This will allow you to select more than one line (notice the [\n\s]* part between xml nodes - this will select lines with any amount of new lines or whitespaces inbetween)
Read more about re.sub here: https://docs.python.org/3/library/re.html
Here is the XSLT for the scenario. It is following so called modified identity transform pattern.
If there is a need for the XML prolog, just modify omit-xml-declaration="yes" as omit-xml-declaration="no".
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity template -->
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="item">
<product>
<xsl:apply-templates />
</product>
</xsl:template>
</xsl:stylesheet>

Parsing an xml file using lxml

I'm trying to edit an xml file by finding each Watts tag and changing the text in it. So far I've managed to change all tags, but not the Watts tag specifically.
My parser is:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
This keeps my cycling.xml file unchanged. A snippet from output.xml (which is also the cycling.xml file since this is unchanged) is:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
If I change my parser to change all tags with:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
Then my output.xml file becomes:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
How can I change just the Watts tag?
I don't understand what the root = tree.getroot() does. I just thought I'd ask this question at the same time, although I'm not sure it matters in my particular problem.
Your document defines a default XML namespace. Look at the xmlns= attribute at the end of the opening tag:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
This means there is no element named "Watts" in your document; you will need to qualify tag names with the appropriate namespace. If you print out the value of watt.tag in our loop, you will see:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
With this in mind, you can modify your filter so that it looks like
this:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
You can read more about namespace handling in the lxml documentation.
Alternatively, since you use two important words edit xml and you are using lxml, consider XSLT (the XML transformation language) where you can define a namespace prefix and change Watts anywhere in document without looping. Plus, you can pass values into XSLT from Python!
XSLT (save as .xsl file)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Pull out sections from XML in Python

Please note that I have some Python experience but not a lot of deep experience so please bear with me.
I have a very large XML file, ~100 megs, that has many, many sections and subsections. I need to pull out each subsection of a certain type (and there are a lot with this type) and write each to a different file. The writing I can handle, but I'm staring at ElementTree documentation trying to make sense of how to traverse the tree, find an element declared this way, yank out just the data between those tags and process it, then continue down the file.
The structure is similar to this (slightly obfuscated). What I want to do is pull out each section labeled "data" individually.
<filename>
<config>
<collections>
<datas>
<data>
...
</data>
<data>
...
</data>
<data>
...
</data>
</datas>
</collections>
</config>
</filename>
I think you can read in each data element using iterparse and then write it out, the following simply prints the element using the print function but you could of course instead write it to a file:
import xml.etree.ElementTree as ET
for event, elem in ET.iterparse("input.xml"):
if elem.tag == 'data':
print(ET.tostring(elem, 'UTF-8', 'xml'))
elem.clear()
Consider an XSLT solution with Python's third-party module, lxml. Specifically, you xpath() for the length of <data> nodes and then iteratively build a dynamic XSLT script parsing only the needed element by node index [#] for outputted individual XML files:
import lxml.etree as et
dom = et.parse('Input.xml')
datalen = len(dom.xpath("//data"))
for i in range(1, datalen+1):
xsltstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="datas">
<xsl:apply-templates select="data[{0}]" />
</xsl:template>
<xsl:template match="data[{0}]">
<xsl:copy>
<xsl:copy-of select="*"/>
</xsl:copy>
</xsl:template>
</xsl:transform>'''.format(i)
xslt = et.fromstring(xsltstr)
transform = et.XSLT(xslt)
newdom = transform(dom)
tree_out = et.tostring(newdom, encoding='UTF-8', pretty_print=True,
xml_declaration=True)
xmlfile = open('Data{}.xml', 'wb')
xmlfile.write(tree_out)
xmlfile.close()

lxml (or lxml.html): print tree structure

I'd like to print out the tree structure of an etree (formed from an html document) in a differentiable way (means that two etrees should print out differently).
What I mean by structure is the "shape" of the tree, which basically means all the tags but no attribute and no text content.
Any idea? Is there something in lxml to do that?
If not, I guess I have to iterate through the whole tree and construct a string from that. Any idea how to represent the tree in a compact way? (the "compact" feature is less relevant)
FYI it is not intended to be looked at, but to be stored and hashed to be able to make differences between several html templates.
Thanks
Maybe just run some XSLT over the source XML to strip everything but the tags, it's then easy enough to use etree.tostring to get a string you could hash...
from lxml import etree as ET
def pp(e):
print ET.tostring(e, pretty_print=True)
print
root = ET.XML("""\
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8" />
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
""")
pp(root)
xslt = ET.XML("""\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
""")
tr = ET.XSLT(xslt)
doc2 = tr(root)
root2 = doc2.getroot()
pp(root2)
Gives you the output:
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8"/>
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
<project>
<livefolder/>
<livefolder/>
<preference-set>
<boolean/>
</preference-set>
</project>

Categories

Resources