Overwriting an XML file and one of my nameSpace is missing

Overwriting an XML file and one of my nameSpace is missing - python

I am parsing an XML file , replacing value on it and overwriting it, everything works fine but one of my two root's namespaces is missing after the overwrite.
I found that i have to register my namespaces, i did it but it doesnt change it:
There is the Xml file input :
<?xml version="1.0" encoding ="utf8"?>
<Document xmlns:xsi = "sample" xmlns ="sample2">
and there is the output :
<?xml version='1.0' encoding='UTF-8'?>
<Document xmlns="sample2">
there is when i register my namespace :
ET.register_namespace('xsi' , "sample")
ET.register_namespace('' , "Sample2" )
the writing method :
tree.write(path , xml_declaration=True, method='xml', encoding='UTF-8')
do you have any idea what is the problem and how can i fix it ?

It probably would be easier using lxml library:
from lxml import etree
nsmap = {'xsi': "sample", None: "sample2"}
root = etree.Element('Document', nsmap=nsmap)
print(etree.tostring(root))
Which gives desired output:
<Document xmlns:xsi="sample" xmlns="sample2"/>

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

I am having problems generating a XML document using the ElementTree framework in Python 3. I tried registering the namespace before setting up the document. Right now it seems that I can generate a XML document only by adding the namespace to each element like a=Element("{full_namespace_URI}element_name") which seems tedious.
How do I setup the default namespace and can omit putting it in each element?
Any help is appreciated.
I have written a small demo program for Python 3:
from io import BytesIO
from xml.etree import ElementTree as ET
ET.register_namespace("", "urn:dslforum-org:service-1-0")
"""
desired output
==============
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"">
<childNode>content</childNode>
</topNode>
"""
# build XML document without namespaces
a = ET.Element("topNode")
b = ET.Element("childNode")
b.text = "content"
a.append(b)
tree = ET.ElementTree(a)
# build XML document with namespaces
a_ns = ET.Element("{dsl}topNode")
b_ns = ET.Element("{dsl}childNode")
b_ns.text = "content"
a_ns.append(b_ns)
tree_ns = ET.ElementTree(a_ns)
def print_element_tree(element_tree, comment, default_namespace=None):
"""
print element tree with comment to standard out
"""
with BytesIO() as buf:
element_tree.write(buf, encoding="utf-8", xml_declaration=True,
default_namespace=default_namespace)
buf.seek(0)
print(comment)
print(buf.read().decode("utf-8"))
print_element_tree(tree, "Element Tree without XML namespace")
print_element_tree(tree_ns, "Element Tree with XML namespace", "dsl")

I believe you are overthinking this.
Registering a default namespace in your code avoids the ns0: aliases.
Registering any namespaces you will use while creating a document allows you to designate the alias used for each namespace.
To achieve your desired output, assign the namespace to your top element:
a = ET.Element("{urn:dslforum-org:service-1-0}topNode")
The preceding ET.register_namespace("", "urn:dslforum-org:service-1-0") will make that the default namespace in the document, assign it to topNode, and not prefix your tag names.
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"><childNode>content</childNode></topNode>
If you remove the register_namespace() call, then you get this monstrosity:
<?xml version='1.0' encoding='utf-8'?>
<ns0:topNode xmlns:ns0="urn:dslforum-org:service-1-0"><childNode>content</childNode></ns0:topNode>

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?

You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Cannot write XML file with default namespace [duplicate]

This question already has answers here:
Saving XML files using ElementTree
(5 answers)
Closed 7 years ago.
I'm writing a Python script to update Visual Studio project files. They look like this:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" DefaultTargets="Build"
xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup>
...
The following code reads and then writes the file:
import xml.etree.ElementTree as ET
tree = ET.parse(projectFile)
root = tree.getroot()
tree.write(projectFile,
xml_declaration = True,
encoding = 'utf-8',
method = 'xml',
default_namespace = "http://schemas.microsoft.com/developer/msbuild/2003")
Python throws an error at the last line, saying:
ValueError: cannot use non-qualified names with default_namespace option
This is surprising since I'm just reading and writing, with no editing in between. Visual Studio refuses to load XML files without a default namespace, so omitting it is not optional.
Why does this error occur? Suggestions or alternatives welcome.

This is a duplicate to Saving XML files using ElementTree
The solution is to define your default namespace BEFORE parsing the project file.
ET.register_namespace('',"http://schemas.microsoft.com/developer/msbuild/2003")
Then write out your file as
tree.write(projectFile,
xml_declaration = True,
encoding = 'utf-8',
method = 'xml')
You have successfully round-tripped your file. And avoided the creation of ns0 tags everywhere.

I think that lxml does a better job handling namespaces. It aims for an ElementTree-like interface but uses xmllib2 underneath.
>>> import lxml.etree
>>> doc=lxml.etree.fromstring("""<?xml version="1.0" encoding="utf-8"?>
... <Project ToolsVersion="4.0" DefaultTargets="Build"
... xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
... <PropertyGroup>
... </PropertyGroup>
... </Project>""")
>>> print lxml.etree.tostring(doc, xml_declaration=True, encoding='utf-8', method='xml', pretty_print=True)
<?xml version='1.0' encoding='utf-8'?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" ToolsVersion="4.0" DefaultTargets="Build">
<PropertyGroup>
</PropertyGroup>
</Project>

This was the closest answer I could find to my problem. Putting the:
ET.register_namespace('',"http://schemas.microsoft.com/developer/msbuild/2003")
just before the parsing of my file did not work.
You need to find the specific namespace the xml file you are loading is using. To do that, I printed out the Element of the ET tree node's tag which gave me my namespace to use and the tag name, copy that namespace into:
ET.register_namespace('',"XXXXX YOUR NAMESPACEXXXXXX")
before you start parsing your file then that should remove all the namespaces when you write.

Programmatically clean/ignore namespaces in XML - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.

I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

Set a DTD using minidom in python

I am trying to include a reference to a DTD in my XML doc using minidom.
I am creating the document like:
doc = Document()
foo = doc.createElement('foo')
doc.appendChild(foo)
doc.toxml()
This gives me:
<?xml version="1.0" ?>
<foo/>
I need to get something like:
<?xml version="1.0" ?>
<!DOCTYPE something SYSTEM "http://www.path.to.my.dtd.com/my.dtd">
<foo/>

The documentation is out of date. Use the source, Luke. I do it something like this.
from xml.dom.minidom import DOMImplementation
imp = DOMImplementation()
doctype = imp.createDocumentType(
qualifiedName='foo',
publicId='',
systemId='http://www.path.to.my.dtd.com/my.dtd',
)
doc = imp.createDocument(None, 'foo', doctype)
doc.toxml()
This prints the following.
<?xml version="1.0" ?><!DOCTYPE foo SYSTEM \'http://www.path.to.my.dtd.com/my.dtd\'><foo/>
Note how the root element is created automatically by createDocument(). Also, your 'something' has been changed to 'foo': the DTD needs to contain the root element name itself.

According to the Python docs, there is no implementation of the DocumentType interface in the minidom.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overwriting an XML file and one of my nameSpace is missing - python

It probably would be easier using lxml library: from lxml import etree nsmap = {'xsi': "sample", None: "sample2"} root = etree.Element('Document', nsmap=nsmap) print(etree.tostring(root)) Which gives desired output: <Document xmlns:xsi="sample" xmlns="sample2"/>

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

Reading xml with lxml lib geting strange string from xmlns tag

Cannot write XML file with default namespace [duplicate]

Programmatically clean/ignore namespaces in XML - python

Set a DTD using minidom in python

Categories

Resources