How can I use python to remove xml node - python

I want to remove elements from xml files. When I was using ElementTree, I can get all the elements from xml files, but I cannot get the xml statements and annotations.
So if I use:
# get xml nodes
tree = ElementTree.pares()
# do filter things ...
# write to files
tree.write(file_path)
I will miss all the statements and annotations. Is there a way to remove xml elements from *.xml files and keep the annotations, statements or any other things in the files?
For example, the source:
<?xml version="1.0" encoding="utf-8"?>
<!-- I am annotation -->
<string name="name">content</string><string left="left">left things</string>
And my target:
<?xml version="1.0" encoding="utf-8"?>
<!-- I am annotation -->
<string left="left">left things</string>
But when I use tree.write(file_path), it will miss the annotation and statement, become:
<string left="left">left things</string>

Possible using lxml which provide remove_comments=False option to preserve XML comments :
from lxml import etree
parser = etree.XMLParser(remove_comments=False)
tree = etree.parse("input.xml", parser=parser)
root = tree.getroot()
for c in root.findall(".//string[#name='name']"):
root.remove(c)
tree.write("output.xml")
"input.xml" :
<root>
<!-- I am annotation -->
<string name="name">content</string><string left="left">left things</string>
</root>
"output.xml" :
<root>
<!-- I am annotation -->
<string left="left">left things</string>
</root>
Related question :
How to prevent xml.ElementTree fromstring from dropping commentnode

Use https://docs.python.org/2/library/xml.etree.elementtree.html
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
for country in root.findall('//string[#name='left']'):
root.remove(country)
tree.write('output_data.xml')

Related

Getting root node's attributes (namespace) in Python

I need to extract namespace which comes at the very beginning of xml file.
It looks something like this.
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="CannotGetThisAttrib" xmlns:b="CannotGetThisAttrib">
<fileHeader c="CanGetThisAttrib/>>
<body></body>
<fooder/>
</root>
I can extract attributes beneath the root node. However, I cannot get the root node attributes, both a and b, which are namespaces necessary to parse xml file.
tree = ET.parse("xmlfile.xml")
root = tree.getroot()
root.attrib => None
root[0].attrib["c"] => CanGetThisAttrib
Any advice is appreciated.
Here (using lxml)
from lxml import etree
data = '''<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="CannotGetThisAttrib" xmlns:b="CannotGetThisAttrib">
<fileHeader c="CanGetThisAttrib"/>
<body></body>
<fooder/>
</root>
'''
data = data.encode('ascii')
tree = etree.fromstring(data)
for k,v in tree.nsmap.items():
print('{} -> {}'.format(k,v))
output
a -> CannotGetThisAttrib
b -> CannotGetThisAttrib

Parsing an xml file using lxml

I'm trying to edit an xml file by finding each Watts tag and changing the text in it. So far I've managed to change all tags, but not the Watts tag specifically.
My parser is:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "Watts":
watt.text = "strong"
tree.write("output.xml")
This keeps my cycling.xml file unchanged. A snippet from output.xml (which is also the cycling.xml file since this is unchanged) is:
<TrainingCenterDatabase xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
<Activities>
<Activity Sport="Biking">
<Id>2018-05-06T20:49:56Z</Id>
<Lap StartTime="2018-05-06T20:49:56Z">
<TotalTimeSeconds>2495.363</TotalTimeSeconds>
<DistanceMeters>15345</DistanceMeters>
<MaximumSpeed>18.4</MaximumSpeed>
<Calories>0</Calories>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-05-06T20:49:56Z</Time>
<Position>
<LatitudeDegrees>49.319297</LatitudeDegrees>
<LongitudeDegrees>-123.024128</LongitudeDegrees>
</Position>
<HeartRateBpm>
<Value>99</Value>
</HeartRateBpm>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>0</Watts>
<Speed>2</Speed>
</TPX>
</Extensions>
</Trackpoint>
If I change my parser to change all tags with:
for watt in root.iter():
if watt.tag != "Watts":
watt.text = "strong"
Then my output.xml file becomes:
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">strong<Activities>strong<Activity Sport="Biking">strong<Id>strong</Id>
<Lap StartTime="2018-05-06T20:49:56Z">strong<TotalTimeSeconds>strong</TotalTimeSeconds>
<DistanceMeters>strong</DistanceMeters>
<MaximumSpeed>strong</MaximumSpeed>
<Calories>strong</Calories>
<Intensity>strong</Intensity>
<TriggerMethod>strong</TriggerMethod>
<Track>strong<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
<Trackpoint>strong<Time>strong</Time>
<Position>strong<LatitudeDegrees>strong</LatitudeDegrees>
<LongitudeDegrees>strong</LongitudeDegrees>
</Position>
<AltitudeMeters>strong</AltitudeMeters>
<HeartRateBpm>strong<Value>strong</Value>
</HeartRateBpm>
<Extensions>strong<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">strong<Watts>strong</Watts>
<Speed>strong</Speed>
</TPX>
</Extensions>
</Trackpoint>
How can I change just the Watts tag?
I don't understand what the root = tree.getroot() does. I just thought I'd ask this question at the same time, although I'm not sure it matters in my particular problem.
Your document defines a default XML namespace. Look at the xmlns= attribute at the end of the opening tag:
<TrainingCenterDatabase
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2">
This means there is no element named "Watts" in your document; you will need to qualify tag names with the appropriate namespace. If you print out the value of watt.tag in our loop, you will see:
$ python filter.py
{http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}TrainingCenterDatabase
[...]
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts
{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Speed
With this in mind, you can modify your filter so that it looks like
this:
from lxml import etree
tree = etree.parse("cycling.xml")
root = tree.getroot()
for watt in root.iter():
if watt.tag == "{http://www.garmin.com/xmlschemas/ActivityExtension/v2}Watts":
watt.text = "strong"
tree.write("output.xml")
You can read more about namespace handling in the lxml documentation.
Alternatively, since you use two important words edit xml and you are using lxml, consider XSLT (the XML transformation language) where you can define a namespace prefix and change Watts anywhere in document without looping. Plus, you can pass values into XSLT from Python!
XSLT (save as .xsl file)
<?xml version="1.0"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://www.garmin.com/xmlschemas/ActivityExtension/v2" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="python_value">
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- ADJUST WATTS TEXT -->
<xsl:template match="doc:Watts">
<xsl:copy><xsl:value-of select="$python_value"/></xsl:copy>
</xsl:template>
</xsl:transform>
Python
from lxml import etree
# LOAD XML AND XSL
doc = etree.parse("cycling.xml")
xsl = etree.parse('XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = etree.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = etree.XSLT.strparam('Strong')
result = transform(doc, python_value=n)
# PRINT TO CONSOLE
print(result)
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

How to write XML file with xml tags that include html tags in its text using Python?

I recently came to the realization that XML containing HTML tags in body text for some of the tags seem to make parsers like WP All Import choke.
So to mitigate this, I attempted to write a Python script to properly put out XML.
It starts with this XML file (this is just an excerpt):
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
</Row>
...
</Root>
The desired output is:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like , <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
Unfortunately, I'm getting the following with weird escape characters like:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
So I'd like to fix the following:
1) Output new XML file that preserves the text including the HTML in the newly introduced "Introduction_Body" tag as well as any other tags like "Waterfall_Name"
2) Is it possible to cleanly pretty print this (for human-readability)? How?
My Python code currently looks like this:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
intro_tree = directory+introduction
with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
intro_text = f.read()
intro_body = ET.SubElement(element,'Introduction_Body')
intro_body.text = '<![CDATA[' + intro_text + ']]>'
#tree.write('new_' + data_file) #same result but leaves out the xml header
f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()
Thanks,
Johnny
I would recommend you switch to lxml. It is well-documented and (almost) completely compatible with python's own xml. You might only have to minimally change your code. lxml supports CDATA very handily:
> from lxml import etree
> elmnt = etree.Element('root')
> elmnt.text = etree.CDATA('abcd')
> etree.dump(elmnt)
<root><![CDATA[abcd]]></root>
That aside, you should definitely use whatever library you use not only for parsing xml, but also for writing it! lxml will do the declaration for you:
> print(etree.tostring(elmnt, encoding="utf-8"))
<?xml version='1.0' encoding='utf-8'?>
<root><![CDATA[abcd]]></root>

How to keep the xml-stylesheet?

I want to keep the xml-stylesheet. But it doesn't work.
I use Python to modify the XML for deploy hadoop automatically.
XML:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c11:9000</value>
  </property>
</configuration>
Code:
from xml.etree.ElementTree import ElementTree as ET
def modify_core_site(namenode_hostname):
tree = ET()
tree.parse("pkg/core-site.xml")
root = tree.getroot()
for p in root.iter("property"):
name = p.find("name").text
if name == "fs.default.name":
text = "hdfs://%s:9000" % namenode_hostname
p.find("value").text = text
tree.write("pkg/tmp.xml", encoding="utf-8", xml_declaration=True)
modify_core_site("c80")
Result:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c80:9000</value>
  </property>
</configuration>
The xml-stylesheet disappear...
How can I keep this?
One solution is you can use lxml Once you parse xml go till you find the xsl node. Quick sample below:
>>> import lxml.etree
>>> doc = lxml.etree.parse('C:/downloads/xmltest.xml')
>>> root = doc.getroot()
>>> xslnode=root.getprevious().getprevious()
>>> xslnode
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Make sure you put in some exception handling and check if the node indeed exists. You can check if the node is xslt processing instruction by
>>> isinstance(xslnode, lxml.etree._XSLTProcessingInstruction)
True

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

Categories

Resources