How to keep the xml-stylesheet? - python

I want to keep the xml-stylesheet. But it doesn't work.
I use Python to modify the XML for deploy hadoop automatically.
XML:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c11:9000</value>
  </property>
</configuration>
Code:
from xml.etree.ElementTree import ElementTree as ET
def modify_core_site(namenode_hostname):
tree = ET()
tree.parse("pkg/core-site.xml")
root = tree.getroot()
for p in root.iter("property"):
name = p.find("name").text
if name == "fs.default.name":
text = "hdfs://%s:9000" % namenode_hostname
p.find("value").text = text
tree.write("pkg/tmp.xml", encoding="utf-8", xml_declaration=True)
modify_core_site("c80")
Result:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://c80:9000</value>
  </property>
</configuration>
The xml-stylesheet disappear...
How can I keep this?

One solution is you can use lxml Once you parse xml go till you find the xsl node. Quick sample below:
>>> import lxml.etree
>>> doc = lxml.etree.parse('C:/downloads/xmltest.xml')
>>> root = doc.getroot()
>>> xslnode=root.getprevious().getprevious()
>>> xslnode
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Make sure you put in some exception handling and check if the node indeed exists. You can check if the node is xslt processing instruction by
>>> isinstance(xslnode, lxml.etree._XSLTProcessingInstruction)
True

Related

How to insert a processing instruction in XML file?

I want to add a xml-stylesheet processing instruction before the root element in my XML file using ElementTree (Python 3.8).
You find as below my code that I used to create XML file
import xml.etree.cElementTree as ET
def Export_star_xml( self ):
star_element = ET.Element("STAR",**{ 'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance' })
element_node = ET.SubElement(star_element ,"STAR_1")
element_node.text = "Mario adam"
tree.write( "star.xml" ,encoding="utf-8", xml_declaration=True )
Output:
<?xml version="1.0" encoding="windows-1252"?>
<STAR xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<STAR_1> Mario adam </STAR_1>
</STAR>
Output Expected:
<?xml version="1.0" encoding="windows-1252"?>
<?xml-stylesheet type="text/xsl" href="ResourceFiles/form_star.xsl"?>
<STAR xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<STAR_1> Mario adam </STAR_1>
</STAR>
I cannot figure out how to do this with ElementTree. Here is a solution that uses lxml, which provides an addprevious() method on elements.
from lxml import etree as ET
# Note the use of nsmap. The syntax used in the question is not accepted by lxml
star_element = ET.Element("STAR", nsmap={'xsi': 'http://www.w3.org/2001/XMLSchema-instance'})
element_node = ET.SubElement(star_element ,"STAR_1")
element_node.text = "Mario adam"
# Create PI and and insert it before the root element
pi = ET.ProcessingInstruction("xml-stylesheet", text='type="text/xsl" href="ResourceFiles/form_star.xsl"')
star_element.addprevious(pi)
ET.ElementTree(star_element).write("star.xml", encoding="utf-8",
xml_declaration=True, pretty_print=True)
Result:
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="ResourceFiles/form_star.xsl"?>
<STAR xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<STAR_1>Mario adam</STAR_1>
</STAR>

Read/Extract data from XML with Python

I am trying to read/extract data from XML with Python using xml.etree.ElementTree.
Unfortunately, up to now, I didn't find how to do it. Most probably because I didn't understand how xml works.
The idea is to write the DocumentId number as a list
Here is my XML file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<RegisterSearch TotalResults="4">
<SearchResults>
<Document DocumentId="1348828088501913376">
<DocumentNumber>001</DocumentNumber>
</Document>
<Document DocumentId="1348828088501881434">
<DocumentNumber>001</DocumentNumber>
</Document>
<Document DocumentId="1348828088539553420">
<DocumentNumber>010</DocumentNumber>
</Document>
<Document DocumentId="1348828088539570694">
<DocumentNumber>010</DocumentNumber>
</Document>
</SearchResults>
</RegisterSearch>
And here is my Python code:
#!/usr/bin/python2
import xml.etree.ElementTree as ET
tree = ET.parse('documents.xml')
root = tree.getroot()
for elem in root:
if(elem.tag=='Document'):
print elem.get('DocumentId')
This is what I try to achieve:
1348828088501913376
1348828088501881434
1348828088539553420
1348828088539570694
Actually, the code brings back nothing...
Thanks in advance for your suggestion.
Iterate over the tags you are interested in:
for elem in root.iter(tag='Document'):
print(elem.get('DocumentId'))
Your original solution would work with
for elem in root.iter():
...
v3.8: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
v2.7: https://docs.python.org/2.7/library/xml.etree.elementtree.html#finding-interesting-elements

Getting root node's attributes (namespace) in Python

I need to extract namespace which comes at the very beginning of xml file.
It looks something like this.
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="CannotGetThisAttrib" xmlns:b="CannotGetThisAttrib">
<fileHeader c="CanGetThisAttrib/>>
<body></body>
<fooder/>
</root>
I can extract attributes beneath the root node. However, I cannot get the root node attributes, both a and b, which are namespaces necessary to parse xml file.
tree = ET.parse("xmlfile.xml")
root = tree.getroot()
root.attrib => None
root[0].attrib["c"] => CanGetThisAttrib
Any advice is appreciated.
Here (using lxml)
from lxml import etree
data = '''<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="CannotGetThisAttrib" xmlns:b="CannotGetThisAttrib">
<fileHeader c="CanGetThisAttrib"/>
<body></body>
<fooder/>
</root>
'''
data = data.encode('ascii')
tree = etree.fromstring(data)
for k,v in tree.nsmap.items():
print('{} -> {}'.format(k,v))
output
a -> CannotGetThisAttrib
b -> CannotGetThisAttrib

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

How to add an xml-stylesheet processing instruction node with Python 2.6 and minidom?

I'm creating an XML document using minidom - how do I ensure my resultant XML document contains a stylesheet reference like this:
<?xml-stylesheet type="text/xsl" href="mystyle.xslt"?>
Thanks !
Use something like this:
from xml.dom import minidom
xml = """
<root>
<x>text</x>
</root>"""
dom = minidom.parseString(xml)
pi = dom.createProcessingInstruction('xml-stylesheet',
'type="text/xsl" href="mystyle.xslt"')
root = dom.firstChild
dom.insertBefore(pi, root)
print dom.toprettyxml()
=>
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="mystyle.xslt"?>
<root>
<x>
text
</x>
</root>
I am not familiar with minidom, but you must create a processing instruction node (PI) with name: "xml-stylesheet" and text: "type='text/xsl' href='mystyle.xslt'"
Read the documentation how a PI is created.
import xml.dom
dom = xml.dom.minidom.parse("C:\\Temp\\Report.xml")
pi = dom.createProcessingInstruction('xml-stylesheet',
'type="text/xsl" href="TestCaseReport.xslt"')
root = dom.firstChild
dom.insertBefore(pi, root)
a = dom.toxml()
f = open("C:\\Report(1).xml",'w')
f.write(a)
f.close()

Categories

Resources