ElementTree appends extra information when getting root element - python

I have an xml like shown below
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dtbook PUBLIC "-//INFO//INFO info 2005-3//EN" "http://url">
<dtbook xmlns="http://www.daisy.org/z3986/2005/dtbook/" version="2005-3" xml:lang="ml">
<head>....
</dtbook>
I open the file like so,
with open("filename.xml") as f:
tree = ET.parse(f)
root = tree.getroot()
When I try to get the root tag, I get,
print(root.tag)
{http://www.daisy.org/z3986/2005/dtbook/}dtbook
whereas if I remove all the attributes from the root tag i.e. dtbook, I get the correct output i.e. dtbook
print(root.tag)
dtbook
I cannot remove the attributes. Is there a way to get this working without removing the attributes??

This is called a namespace and is supposed to be in front. You can simply remove the namespace by splitting your string at {}

Related

Refencing the node of a .props file in python

Sorry if this is a silly question, but I've been trying to reference the directory in a .props file, that way I could parse it and use it as a variable for a python project. The problem however is that the header of the file is being treated as the root of the program, and I haven't been able to reference the main root no matter what I've done. I've tried 'json.dumps()', 'root.iter()', 'root.findall()', and a slew of other options to try and get past the header, but everytime I try the result either generates an error, or nothing at all.
I'm guessing it's because I'm using a props file and while similar, these solutions are supposed to be for .xml files, but I haven't found anything that implies I should be dealing with .props files any differently.
In short. How can I take the information in the MainRoot node below, and, in a separate python program, parse it and make it into a variable? Said props file is below.
<!--YouFoundMe.props-->
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="14.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup>
<MainRoot>..\..\..\YouFoundMe</MainRoot>
</PropertyGroup>
</Project>
This may not be important, but if it helps, I'll also post a python file containing some of the failed methods I tried below:
import xml.etree.ElementTree as ET
import json
tree = ET.parse('YouFoundMe\YouFoundMe.props')
root = tree.getroot()
FIND_ME_DIR = json.dumps(root.attrib)
boobop = json.dumps(root.tag)
print(FIND_ME_DIR)
for child in root:
print(child.tag)
print(child.attrib)
for MainRoot in root.iter('Project'):
print(MainRoot.attrib)
for MainRoot in root.iter('PropertyGroup'):
print(MainRoot.attrib)
for MainRoot in root.iter('MainRoot'):
print(MainRoot.attrib)
for child in root.iter('MainRoot'):
print("Aything? Please?")
for PROP in root.findall('PropertyGroup'):
result = PROP.find('MainRoot').text
print(result)
for MainRoot in root.findall('Project'):
print("Text")
for MainRoot in root.findall('PropertyGroup'):
print("Text")
for MainRoot in root.findall('MainRoot'):
print("Text")
element = root.find('Project')
if not element: # careful!
print("element not found, or element has no subelements")
if element is None:
print("element not found")
test = str(root.get("Project"))
print(test)
test = str(root.get("PropertyGroup"))
print(test)
test = str(root.get("MainRoot"))
print(test)
print(tree)
print(root)
Notice that your XML has default namespace declared at the root element level:
xmlns="http://schemas.microsoft.com/developer/msbuild/2003"
Note that descendant elements without prefix, including MainRoot, inherit this default namespace implicitly. You can define a prefix that references the above default namespace and then use that prefix to find MainRoot, for example:
ns = { 'd': 'http://schemas.microsoft.com/developer/msbuild/2003' }
main_root = root.find('.//d:MainRoot', namespaces=ns)
print(main_root.text)

Generate XML Document in Python 3 using Namespaces and ElementTree

I am having problems generating a XML document using the ElementTree framework in Python 3. I tried registering the namespace before setting up the document. Right now it seems that I can generate a XML document only by adding the namespace to each element like a=Element("{full_namespace_URI}element_name") which seems tedious.
How do I setup the default namespace and can omit putting it in each element?
Any help is appreciated.
I have written a small demo program for Python 3:
from io import BytesIO
from xml.etree import ElementTree as ET
ET.register_namespace("", "urn:dslforum-org:service-1-0")
"""
desired output
==============
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"">
<childNode>content</childNode>
</topNode>
"""
# build XML document without namespaces
a = ET.Element("topNode")
b = ET.Element("childNode")
b.text = "content"
a.append(b)
tree = ET.ElementTree(a)
# build XML document with namespaces
a_ns = ET.Element("{dsl}topNode")
b_ns = ET.Element("{dsl}childNode")
b_ns.text = "content"
a_ns.append(b_ns)
tree_ns = ET.ElementTree(a_ns)
def print_element_tree(element_tree, comment, default_namespace=None):
"""
print element tree with comment to standard out
"""
with BytesIO() as buf:
element_tree.write(buf, encoding="utf-8", xml_declaration=True,
default_namespace=default_namespace)
buf.seek(0)
print(comment)
print(buf.read().decode("utf-8"))
print_element_tree(tree, "Element Tree without XML namespace")
print_element_tree(tree_ns, "Element Tree with XML namespace", "dsl")
I believe you are overthinking this.
Registering a default namespace in your code avoids the ns0: aliases.
Registering any namespaces you will use while creating a document allows you to designate the alias used for each namespace.
To achieve your desired output, assign the namespace to your top element:
a = ET.Element("{urn:dslforum-org:service-1-0}topNode")
The preceding ET.register_namespace("", "urn:dslforum-org:service-1-0") will make that the default namespace in the document, assign it to topNode, and not prefix your tag names.
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"><childNode>content</childNode></topNode>
If you remove the register_namespace() call, then you get this monstrosity:
<?xml version='1.0' encoding='utf-8'?>
<ns0:topNode xmlns:ns0="urn:dslforum-org:service-1-0"><childNode>content</childNode></ns0:topNode>

Parse attributes of root node having namespace in XML

I have the following xml which i was trying to parse and wanted to get the value of attributes of root node i.e. xmlns:n1 value. But i am getting the key error using the following value.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Level-1C_Tile_ID xmlns:n1="https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd /dpc/app/s2ipf/FORMAT_METADATA_TILE_L1C/02.10.02/scripts/../../../schemas/02.12.05/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd">
<n1:General_Info>
<TILE_ID metadataLevel="Brief">S2A_OPER_MSI_L1C_TL_MTI__20161111T091803_A007252_T43QBA_N02.04</TILE_ID>
<DATASTRIP_ID metadataLevel="Standard">S2A_OPER_MSI_L1C_DS_MTI__20161111T091803_S20161111T053350_N02.04</DATASTRIP_ID>
<DOWNLINK_PRIORITY metadataLevel="Standard">NOMINAL</DOWNLINK_PRIORITY>
<SENSING_TIME metadataLevel="Standard">2016-11-11T05:33:50.068Z</SENSING_TIME>
<Archiving_Info metadataLevel="Expertise">
<ARCHIVING_CENTRE>MTI_</ARCHIVING_CENTRE>
<ARCHIVING_TIME>2016-11-11T10:53:22.600451Z</ARCHIVING_TIME>
</Archiving_Info>
</n1:General_Info>
</n1:Level-1C_Tile_ID>
Code :
from lxml import etree
tree = etree.parse(XML_FILE_CONTENT_PASTED_ABOVE)
root = tree.getroot()
print(root.attrib['xmlns:n1'])
Error :
KeyError: 'xmlns:n1'
Desired output :
https://psd-12.sentinel2.eo.esa.int/PSD/S2_PDI_Level-1C_Tile_Metadata.xsd
A namespace declaration (xmlns:n1='...') looks like an attribute, but it is not part of the attrib dictionary of an element.
To get the namespace URI associated with the n1 prefix, use nsmap:
print(root.nsmap["n1"])

I want to update value of a particular xml tag using python code?

My xml file looks like below :-
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Messages xmlns="URL/sampleMessages-v1">
<Header>
<TransactionId>0</TransactionId>
<RequestNo>41194812</RequestNo>
<VNo>6789</VNo>
<Source></Source>
</Header>
...
...
</Messages>
I want to read it and change the RequestNo value
<RequestNo>41194812</RequestNo> to
<RequestNo>41194000</RequestNo>
I am using ElementTree module currently. I am using windows machine currently.
I want to update the value in the same file.
Ihave tried below code :-
for elem in root:
for subelem in elem:
#print (subelem.tag)
if 'RequestNo' in subelem.tag :
#print (subelem.text)
subelem.text="41194813"
But i am not able to see the change or i dont know currently how to write this new value subelem.text="41194813" in existing xml file.
Your for loop does the job: it did replace the text correctly. The change is in your root variable. You can verify that by adding the following line right after the for loop:
ElementTree.dump(root)
Now that you have the XML updated, you will need to write that into a file:
tree.write('newfile.xml')
Where tree is the result of ElementTree.parse(). So, to put everything together:
tree = ElementTree.parse('messages.xml')
root = tree.getroot()
for elem in root:
for subelem in elem:
if 'RequestNo' in subelem.tag:
subelem.text = '41194813'
break
tree.write('messages-new.xml')
Dealing with Namespaces
Your XML document contains namespaces, so if you plan to search for a tag, you need to include the namespaces in the tag names. Here is an alternative solution which deals with namespaces:
tree = ElementTree.parse('messages.xml')
root = tree.getroot()
namespaces = {'xxx': 'URL/sampleMessages-v1'}
node = root.find('xxx:Header/xxx:RequestNo', namespaces)
if node is not None:
node.text = '41194813'
tree.write('messages-new.xml')
In the above example, I just gave your namespace the name 'xxx', it can be anything 'foo', 'bar', ... but should be used as prefix in the call to root.find().
Removing "ns0" from Output File
In order to remove "ns0" from output file, you need to register the namespace before writing:
ElementTree.register_namespace('', 'URL/sampleMessages-v1')
tree.write('messages-new.xml')

Programmatically clean/ignore namespaces in XML - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.
I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

Categories

Resources