I need to extract namespace which comes at the very beginning of xml file.
It looks something like this.
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="CannotGetThisAttrib" xmlns:b="CannotGetThisAttrib">
<fileHeader c="CanGetThisAttrib/>>
<body></body>
<fooder/>
</root>
I can extract attributes beneath the root node. However, I cannot get the root node attributes, both a and b, which are namespaces necessary to parse xml file.
tree = ET.parse("xmlfile.xml")
root = tree.getroot()
root.attrib => None
root[0].attrib["c"] => CanGetThisAttrib
Any advice is appreciated.
Here (using lxml)
from lxml import etree
data = '''<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:a="CannotGetThisAttrib" xmlns:b="CannotGetThisAttrib">
<fileHeader c="CanGetThisAttrib"/>
<body></body>
<fooder/>
</root>
'''
data = data.encode('ascii')
tree = etree.fromstring(data)
for k,v in tree.nsmap.items():
print('{} -> {}'.format(k,v))
output
a -> CannotGetThisAttrib
b -> CannotGetThisAttrib
Related
I have the following XML docs:
doc_1: (Xincludes the second document)
<?xml version="1.0" encoding="utf-8"?>
<document xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="doc_2.xml" parse="xml" />
</document>
doc_2:
<?xml version="1.0" encoding="utf-8"?>
<para>This is a paragraph.</para>
My code looks as follows:
from lxml import tree
tree = etree.parse("stck_oflow_test.xml") #load file
tree.xinclude() #recursively includes files
root = tree.getroot()
def print_root():
for child in root:
print (child.tag, child.attrib, child.text)
print_root()
The output is good for my purposes; para {} This is a paragraph.
But, here's where the problem is, if I change doc_2 into a file that has data in the form:
<?xml version="1.0" encoding="utf-8"?>
<para>
<length>177</length>
<weight>63</weight>
</para>
then, the output no longer contains the contents of doc_2. Running the same python code results into the output:
para {}
How to fix this?
Thanks in advance
This is not a problem with XInclude. That part works.
for child in root iterates over the immediate children of root. In this case, there is one child: the <para> element.
To iterate over all descendants (including <length> and <weight>), you can use for child in root.iter() instead. That will result in the following output:
document {}
para {}
length {} 177
weight {} 63
I am trying to read/extract data from XML with Python using xml.etree.ElementTree.
Unfortunately, up to now, I didn't find how to do it. Most probably because I didn't understand how xml works.
The idea is to write the DocumentId number as a list
Here is my XML file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<RegisterSearch TotalResults="4">
<SearchResults>
<Document DocumentId="1348828088501913376">
<DocumentNumber>001</DocumentNumber>
</Document>
<Document DocumentId="1348828088501881434">
<DocumentNumber>001</DocumentNumber>
</Document>
<Document DocumentId="1348828088539553420">
<DocumentNumber>010</DocumentNumber>
</Document>
<Document DocumentId="1348828088539570694">
<DocumentNumber>010</DocumentNumber>
</Document>
</SearchResults>
</RegisterSearch>
And here is my Python code:
#!/usr/bin/python2
import xml.etree.ElementTree as ET
tree = ET.parse('documents.xml')
root = tree.getroot()
for elem in root:
if(elem.tag=='Document'):
print elem.get('DocumentId')
This is what I try to achieve:
1348828088501913376
1348828088501881434
1348828088539553420
1348828088539570694
Actually, the code brings back nothing...
Thanks in advance for your suggestion.
Iterate over the tags you are interested in:
for elem in root.iter(tag='Document'):
print(elem.get('DocumentId'))
Your original solution would work with
for elem in root.iter():
...
v3.8: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
v2.7: https://docs.python.org/2.7/library/xml.etree.elementtree.html#finding-interesting-elements
I want to keep the xml-stylesheet. But it doesn't work.
I use Python to modify the XML for deploy hadoop automatically.
XML:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://c11:9000</value>
</property>
</configuration>
Code:
from xml.etree.ElementTree import ElementTree as ET
def modify_core_site(namenode_hostname):
tree = ET()
tree.parse("pkg/core-site.xml")
root = tree.getroot()
for p in root.iter("property"):
name = p.find("name").text
if name == "fs.default.name":
text = "hdfs://%s:9000" % namenode_hostname
p.find("value").text = text
tree.write("pkg/tmp.xml", encoding="utf-8", xml_declaration=True)
modify_core_site("c80")
Result:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://c80:9000</value>
</property>
</configuration>
The xml-stylesheet disappear...
How can I keep this?
One solution is you can use lxml Once you parse xml go till you find the xsl node. Quick sample below:
>>> import lxml.etree
>>> doc = lxml.etree.parse('C:/downloads/xmltest.xml')
>>> root = doc.getroot()
>>> xslnode=root.getprevious().getprevious()
>>> xslnode
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Make sure you put in some exception handling and check if the node indeed exists. You can check if the node is xslt processing instruction by
>>> isinstance(xslnode, lxml.etree._XSLTProcessingInstruction)
True
I have a very simple KML file which returns no nodes when parsed with ElementTree. This is frustrating me :-). Any clues?
from xml.etree import ElementTree
from pprint import pprint
kml = '''<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Document>
<name>NEXRAD Radar Sites</name>
<Schema parent="Placemark" name="wsr">
<SimpleField type="wstring" name="STATE">
</SimpleField>
</Schema>
<wsr>
<name>KABR</name>
</wsr>
</Document>
</kml>
'''
tree = ElementTree.fromstring(kml)
ElementTree.dump(tree)
for node in tree.iter('wsr'):
pprint(node)
for node in tree.findall('../wsr'):
pprint(node)
The tags are namespaced. If you try tree.iter() with no tag it will show what ElementTree thinks the tags are called. The wsr tag is called {http://earth.google.com/kml/2.0}wsr. This returns a node:
list(tree.iter('{http://earth.google.com/kml/2.0}wsr'))
I'm creating an XML document using minidom - how do I ensure my resultant XML document contains a stylesheet reference like this:
<?xml-stylesheet type="text/xsl" href="mystyle.xslt"?>
Thanks !
Use something like this:
from xml.dom import minidom
xml = """
<root>
<x>text</x>
</root>"""
dom = minidom.parseString(xml)
pi = dom.createProcessingInstruction('xml-stylesheet',
'type="text/xsl" href="mystyle.xslt"')
root = dom.firstChild
dom.insertBefore(pi, root)
print dom.toprettyxml()
=>
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="mystyle.xslt"?>
<root>
<x>
text
</x>
</root>
I am not familiar with minidom, but you must create a processing instruction node (PI) with name: "xml-stylesheet" and text: "type='text/xsl' href='mystyle.xslt'"
Read the documentation how a PI is created.
import xml.dom
dom = xml.dom.minidom.parse("C:\\Temp\\Report.xml")
pi = dom.createProcessingInstruction('xml-stylesheet',
'type="text/xsl" href="TestCaseReport.xslt"')
root = dom.firstChild
dom.insertBefore(pi, root)
a = dom.toxml()
f = open("C:\\Report(1).xml",'w')
f.write(a)
f.close()