Find text accross mutiple lines to replace it with python (xml) - python

I have the following XML file:
<?xml version='1.0' encoding='utf8'?>
<Products>
<item type="dict">
<id type="int">37475</id>
<name type="str">something_something</name>
<slug type="str">something_something</slug>
<permalink type="str">something_something</permalink>
<date_created type="str">date</date_created>
<date_created_gmt type="str">date</date_created_gmt>
<date_modified type="str">date</date_modified>
<date_modified_gmt type="str">date</date_modified_gmt>
<type type="str">simple</type>
<status type="str">publish</status>
<featured type="bool">False</featured>
<catalog_visibility type="str">visible</catalog_visibility>
<description type="str">something_something</description>
</item>
I started with a JSON that I converted to a XML file so all of the products in that file start with the <item type="dict"> tag, which is not what I want. I would like for all of the products to be enclosed in a <product> tag.
To fix this issue I am doing the following:
tree = ET.ElementTree(root)
xmlstr = ET.tostring(root, encoding='utf8', method='xml') #xml of each product to string so that it can be edited
finalstr = xmlstr.decode("utf-8").replace(' />','') #remove wrong part
finalstr = finalstr.replace('<item type="dict"> <id type="int">','<product> <id type="int">')
This works for other problems in my XML file, but only when they are on one line.
My question is how do I select two or more lines so that I can replace them?
Desired output:
<?xml version='1.0' encoding='utf8'?>
<Products>
<product>
<id type="int">37475</id>
<name type="str">something_something</name>
<slug type="str">something_something</slug>
<permalink type="str">something_something</permalink>
<date_created type="str">date</date_created>
<date_created_gmt type="str">date</date_created_gmt>
<date_modified type="str">date</date_modified>
<date_modified_gmt type="str">date</date_modified_gmt>
<type type="str">simple</type>
<status type="str">publish</status>
<featured type="bool">False</featured>
<catalog_visibility type="str">visible</catalog_visibility>
<description type="str">something_something</description>
</product>

It should be possible with regular expression:
finalstr = re.sub('<item type="dict">[\n\s]*<id type="int">','<product>\n<id type="int">', finalstr)
This will allow you to select more than one line (notice the [\n\s]* part between xml nodes - this will select lines with any amount of new lines or whitespaces inbetween)
Read more about re.sub here: https://docs.python.org/3/library/re.html

Here is the XSLT for the scenario. It is following so called modified identity transform pattern.
If there is a need for the XML prolog, just modify omit-xml-declaration="yes" as omit-xml-declaration="no".
XSLT
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity template -->
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="item">
<product>
<xsl:apply-templates />
</product>
</xsl:template>
</xsl:stylesheet>

Related

How to evaluate XSLT processor parameters in a local context?

Let's say I have this source XML:
<A>
<B>something</B>
<B>something else</B>
</A>
and I want to transform it into this target XML:
<C>
<D>something</D>
<D>something else</D>
</C>
The obvious XSL of course is this:
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="A">
<C>
<xsl:for-each select="B">
<D><xsl:value-of select="."/></D>
</xsl:for-each>
</C>
</xsl:template>
</xsl:stylesheet>
Now let's say I don't know the paths I'm going to use beforehand and I want to parametrize them from my processor, which happens to be lxml (in Python).
So I change my XSL into this:
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="path_of_B"/>
<xsl:template match="A">
<C>
<xsl:for-each select="$path_of_B">
<D><xsl:value-of select="."/></D>
</xsl:for-each>
</C>
</xsl:template>
</xsl:stylesheet>
and I call it from Python like this:
source = etree.parse("source.xml")
transform = etree.XSLT(etree.parse("transform.xsl"))
target = transform(source, path_of_B="B")
This doesn't give me the intended result because when I pass the paths from the processor they are always evaluated in a global context, the current() node is always the root, no matter where I use the parameter. Is there any way to evaluate the XPaths in the correct context like they do in the first example where I write them by hand?
I have tried many approaches like
Passing parameters in nested templates, because I thought the evaluation would have the context of the template
Passing the parameters as strings and evaluate them later, but XPath 1.0 doesn't have an eval() function like Python.
Attribute value templates, but it is not allowed on xsl elements
At some point I even touched <xsl:namespace-alias> to dynamically generate my XSL but it was very confusing.
So in the end, I solved it by pre-processing my xsl file with a template engine or string-formatting. It works, but I was just wondering if there is a "pure" XSLT+processor solution.
XPath 1.0 doesn't have an eval() function
No, but the libxslt processor supports the EXSLT dyn:evaluate() extension function - so you could do:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:dyn="http://exslt.org/dynamic"
extension-element-prefixes="dyn">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:param name="path_of_B"/>
<xsl:template match="/A">
<C>
<xsl:for-each select="dyn:evaluate($path_of_B)">
<D>
<xsl:value-of select="."/>
</D>
</xsl:for-each>
</C>
</xsl:template>
</xsl:stylesheet>
If you want to parametrize both your input and output element names you could do something like this.
Although this method would not work well if your source XML's structure is not always the same.
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="e1_input" select="'A'"/>
<xsl:param name="e1_output" select="'A_OUT'"/>
<xsl:param name="e2_input" select="'B'"/>
<xsl:param name="e2_output" select="'B_OUT'"/>
<xsl:template match="/">
<xsl:for-each select="*[name()=$e1_input]">
<xsl:element name="{$e1_output}">
<xsl:for-each select="*[name()=$e2_input]">
<xsl:element name="{$e2_output}">
<xsl:apply-templates/>
</xsl:element>
</xsl:for-each>
</xsl:element>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
See it working here : https://xsltfiddle.liberty-development.net/aKZkh9

use xslt to concat repeating child nodes into one line of csv/ tsv

if I have the following xml file
<AnnotationSet Name="Bio">
<Annotation Id="6164" Type="Health_Care_Related_Organization" StartNode="0" EndNode="6">
<Feature>
<Name className="java.lang.String">VOCABS</Name>
<Value className="java.lang.String">NCI</Value>
</Feature>
<Feature>
<Name className="java.lang.String">Negation</Name>
<Value className="java.lang.String">Affirmed</Value>
</Feature>
<Feature>
<Name className="java.lang.String">inst_full</Name>
<Value className="java.lang.String">http://linkedlifedata.com/resource/umls/id/C0002424</Value>
</Feature>
<Feature>
<Name className="java.lang.String">Experiencer</Name>
<Value className="java.lang.String">Patient</Value>
</Feature>
<Feature>
<Name className="java.lang.String">PREF</Name>
<Value className="java.lang.String">Clinic</Value>
</Feature>
<Feature>
<Name className="java.lang.String">inst</Name>
<Value className="java.lang.String">C0002424</Value>
</Feature>
<Feature>
<Name className="java.lang.String">STY</Name>
<Value className="java.lang.String">Health Care Related Organization</Value>
</Feature>
<Feature>
<Name className="java.lang.String">TUI</Name>
<Value className="java.lang.String">T093</Value>
</Feature>
<Feature>
<Name className="java.lang.String">language</Name>
<Value className="java.lang.String"></Value>
</Feature>
<Feature>
<Name className="java.lang.String">Temporality</Name>
<Value className="java.lang.String">Recent</Value>
</Feature>
<Feature>
<Name className="java.lang.String">tui_full</Name>
<Value className="java.lang.String">http://linkedlifedata.com/resource/semanticnetwork/id/T093</Value>
</Feature>
</Annotation>
</AnnotationSet>
I would like to be able to take the Name element of each child node <Feature> and a column header, and the <Value> element as a value and put into a csv or tsv. I would also like StartNode and EndNode as columns from the <Annotation> node.
It would look something like:
StartNode EndNode VOCABS Negation ....
--------- ------- ------ -------- ----
0 6 NCI Affirmed ....
I am only familiar with writing XSLT where each node i.e. <Feature> contains every column for each row. Here, each row is contained within <Annotation> and am having difficulty pulling out what I need.
I tried writing the following xslt:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="text" encoding="utf-8"/>
<xsl:template match="/">
<xsl:text>Name, Value
</xsl:text>
<xsl:for-each select="AnnotationSet/Annotation/Feature">
<xsl:value-of select="concat(Name,',',Value)"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
but cannot get this to run when testing on http://www.freeformatter.com/xsl-transformer.html
Anyone have any ideas?
I would ideally like to do this in python once the xslt is tested, where I have the following python script:
#!/usr/bin/env python
import lxml.etree as ET
import sys
import os
dom = ET.parse('gatetest.xml')
xslt = ET.parse('gate.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))
Your attempt to concatenate Name and Value makes no sense, because you need them one below the other, not one besides the other. Not to mention that you need each name only once (in the top row).
Try it this way instead:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/AnnotationSet">
<xsl:text>StartNode EndNode</xsl:text>
<xsl:for-each select="Annotation[1]/Feature">
<xsl:text> </xsl:text>
<xsl:value-of select="Name"/>
</xsl:for-each>
<xsl:for-each select="Annotation">
<xsl:text>
</xsl:text>
<xsl:value-of select="#StartNode" />
<xsl:text> </xsl:text>
<xsl:value-of select="#EndNode" />
<xsl:for-each select="Feature">
<xsl:text> </xsl:text>
<xsl:value-of select="Value"/>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The result in your example will look something like this (it's difficult to show tab-separated values correctly aligned):
StartNode EndNode VOCABS Negation inst_full Experiencer PREF inst STY TUI language Temporality tui_full
0 6 NCI Affirmed http://linkedlifedata.com/resource/umls/id/C0002424 Patient Clinic C0002424 Health Care Related Organization T093 Recent http://linkedlifedata.com/resource/semanticnetwork/id/T093

merging two xml files and appending elements for similar elements and moving elements that aren't present in one file in python

I want to merge two XML files. I read many solutions but they are specific to those files. I am using xml.etree.ElementTree as well as lxml for parsing, comparing the files, getting the differences. I understand my next step is:
for element in file2.xml:
if element present in file1.xml:
append to output_file.xml
else:
copy element to the output_file
but I haven't worked much on XML, and the tools to merge are licensed, so I need to write a generic script to merge to the format I want.
file1.xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<great_grands>
<great_grandpa_name_one>great_grandpa_name</great_grandpa_name_one>
<grandpa>
<grandpa_name>grandpa_name_one_1</grandpa_name>
</grandpa>
<grandpa>
<grandpa_name>grandpa_name_two_1</grandpa_name>
</grandpa>
<grandma>
<grandma_name>grandma_name_one_1</grandma_name>
</grandma>
<grandma>
<grandma_name>grandma_name_two_1</grandma_name>
</grandma>
</great_grands>
file2.xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<great_grands>
<great_grandpa_name_two>great_grandpa_name</great_grandpa_name_two>
<grandpa>
<grandpa_name_2>grandpa_name_one_2</grandpa_name_2>
</grandpa>
<grandma>
<grandma_name_2>grandma_name_one_2</grandma_name_2>
</grandma>
</great_grands>
Required output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<great_grands>
<great_grandpa_name_one>great_grandpa_name</great_grandpa_name_one>
<great_grandma_name_two>great_grandma_name</great_grandma_name_two>
<grandpa>
<grandpa_name>grandpa_name_one_1</grandpa_name>
</grandpa>
<grandpa>
<grandpa_name>grandpa_name_two_1</grandpa_name>
</grandpa>
<grandpa>
<grandpa_name_2>grandpa_name_one_2</grandpa_name_2>
</grandpa>
<grandma>
<grandma_name>grandma_name_one_1</grandma_name>
</grandma>
<grandma>
<grandma_name>grandma_name_two_1</grandma_name>
</grandma>
<grandma>
<grandma_name_2>grandma_name_one_2</grandma_name_2>
</grandma>
</great_grands>
Consider XSLT, the special-purpose declarative language and sibling to XPath, designed to transform XML files. Using its document() function, it can parse from external XML files at relative links. Python's lxml module can process XSLT 1.0 scripts.
And because XSLT scripts are well-formed XML files you can parse from file or embedded string. Below assumes all files and scripts are saved in same directory:
XSLT Script (save as .xsl script, notice only file2.xml is referenced)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="/great_grands">
<xsl:copy>
<xsl:copy-of select="great_grandpa_name_one"/>
<xsl:copy-of select="document('file2.xml')/great_grands/great_grandpa_name_two"/>
<xsl:copy-of select="grandpa"/>
<xsl:copy-of select="document('file2.xml')/great_grands/grandpa"/>
<xsl:copy-of select="grandma"/>
<xsl:copy-of select="document('file2.xml')/great_grands/grandma"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Python Script (notice only file1.xml is referenced)
from lxml import etree
xml = etree.parse('file1.xml')
xsl = etree.parse('XSLTScript.xsl')
transform = etree.XSLT(xsl)
newdom = transform(xml)
# SAVE NEW DOM STRING TO FILE
with open('Output.xml', 'wb') as f:
f.write(newdom)
Output
<?xml version="1.0" encoding="UTF-8"?>
<great_grands>
<great_grandpa_name_one>great_grandpa_name</great_grandpa_name_one>
<great_grandpa_name_two>great_grandpa_name</great_grandpa_name_two>
<grandpa>
<grandpa_name>grandpa_name_one_1</grandpa_name>
</grandpa>
<grandpa>
<grandpa_name>grandpa_name_two_1</grandpa_name>
</grandpa>
<grandpa>
<grandpa_name_2>grandpa_name_one_2</grandpa_name_2>
</grandpa>
<grandma>
<grandma_name>grandma_name_one_1</grandma_name>
</grandma>
<grandma>
<grandma_name>grandma_name_two_1</grandma_name>
</grandma>
<grandma>
<grandma_name_2>grandma_name_one_2</grandma_name_2>
</grandma>
</great_grands>

XSLT transformation gives only root element python lxml

Currently doing XML-XSLT transformation using following code.
from lxml import etree
xmlRoot = etree.parse('path/abc.xml')
xslRoot = etree.parse('path/abc.xsl')
transform = etree.XSLT(xslRoot)
newdom = transform(xmlRoot)
print(etree.tostring(newdom, pretty_print=True))
The following code works fine but gives only the root element as output but not the whole XML content. When i run the transformation for the same XML and XSL file using Altova it works fine doing the transformation. Is the syntax for printing the whole XML is different or any errors in here that u find out?
XML content :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<slide name="slide7.xml" nav_lvl_1="Solutions" nav_lvl_2="Value Map" page_number="7">
<title>Retail Value Map</title>
<Subheadline>Retail </Subheadline>
</slide>
</root>
XSL content:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output encoding="UTF-8" indent="yes" method="xml" standalone="yes" version="1.0"/>
<xsl:template match="/">
<p:sld xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
<xsl:for-each select="root/slide">
<xsl:choose>
<xsl:when test="#nav_lvl_1='Solutions'">
<xsl:if test="#nav_lvl_2='Value Map'">
<p:txBody>
<a:p>
<a:r>
<a:rPr lang="en-US" dirty="0" smtClean="0"/>
<a:t>
<xsl:value-of select="title"/>
</a:t>
</a:r>
<a:endParaRPr lang="en-US" dirty="0"/>
</a:p>
</p:txBody>
</xsl:if>
</xsl:when>
</xsl:choose>
</xsl:for-each>
</p:sld>
</xsl:template>
Current output :
<p:sld xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"/>

Sorting XML files

Is it possible to sort XML files like the following:
<model name="ford">
<driver>Bob</driver>
<driver>Alice</driver>
</model>
<model name="audi">
<driver>Carly</driver>
<driver>Dean</driver>
</model>
Which would become
<model name="audi">
<driver>Carly</driver>
<driver>Dean</driver>
</model>
<model name="ford">
<driver>Alice</driver>
<driver>Bob</driver>
</model>
That is, the outermost elements are sorted first, then the second outermost, and so on.
They'd need to be sorted by element name first.
This is a refinement of Kirill's solution, I think it better reflects the stated requirements, and it avoids the type error XSLT 2.0 will give you if the sort key contains more than one value (but it still works on 1.0).
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" />
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:apply-templates select="*">
<xsl:sort select="(#name | text())[1]"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Try this XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" />
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()">
<xsl:sort select="text() | #*"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
You can sort nodes by removing them from the parent node, and re-inserting them in the intended order. For example:
def sort_tree(tree):
""" recursively sorts the given etree in place """
for child in tree:
sort_tree(child)
sorted_children = sorted(tree, key=lambda n: n.text)
for child in tree:
tree.remove(child)
for child in reversed(sorted_children):
tree.insert(0, child)
tree = etree.fromstring(YOUR_XML)
sort_tree(tree)
print(etree.tostring(tree, pretty_print=True))
You don't need to sort the entire xml dom.
Instead take the required nodes into a list and sort them. Because we would need the sorted order while processing and not in file, its better done in run time.
May be like this, using minidom.
import os, sys
from xml.dom import minidom
document = """\
<root>
<model name="ford">
<driver>Bob</driver>
<driver>Alice</driver>
</model><model name="audi">
<driver>Carly</driver>
<driver>Dean</driver>
</model>
</root>
"""
document = minidom.parseString(document)
elements = document.getElementsByTagName("model")
elements.sort(key=lambda elements:elements.attributes['name'])

Categories

Resources