Programmatically clean/ignore namespaces in XML - python

Programmatically clean/ignore namespaces in XML - python - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.

I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

I am having problems generating a XML document using the ElementTree framework in Python 3. I tried registering the namespace before setting up the document. Right now it seems that I can generate a XML document only by adding the namespace to each element like a=Element("{full_namespace_URI}element_name") which seems tedious.
How do I setup the default namespace and can omit putting it in each element?
Any help is appreciated.
I have written a small demo program for Python 3:
from io import BytesIO
from xml.etree import ElementTree as ET
ET.register_namespace("", "urn:dslforum-org:service-1-0")
"""
desired output
==============
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"">
<childNode>content</childNode>
</topNode>
"""
# build XML document without namespaces
a = ET.Element("topNode")
b = ET.Element("childNode")
b.text = "content"
a.append(b)
tree = ET.ElementTree(a)
# build XML document with namespaces
a_ns = ET.Element("{dsl}topNode")
b_ns = ET.Element("{dsl}childNode")
b_ns.text = "content"
a_ns.append(b_ns)
tree_ns = ET.ElementTree(a_ns)
def print_element_tree(element_tree, comment, default_namespace=None):
"""
print element tree with comment to standard out
"""
with BytesIO() as buf:
element_tree.write(buf, encoding="utf-8", xml_declaration=True,
default_namespace=default_namespace)
buf.seek(0)
print(comment)
print(buf.read().decode("utf-8"))
print_element_tree(tree, "Element Tree without XML namespace")
print_element_tree(tree_ns, "Element Tree with XML namespace", "dsl")

I believe you are overthinking this.
Registering a default namespace in your code avoids the ns0: aliases.
Registering any namespaces you will use while creating a document allows you to designate the alias used for each namespace.
To achieve your desired output, assign the namespace to your top element:
a = ET.Element("{urn:dslforum-org:service-1-0}topNode")
The preceding ET.register_namespace("", "urn:dslforum-org:service-1-0") will make that the default namespace in the document, assign it to topNode, and not prefix your tag names.
<?xml version='1.0' encoding='utf-8'?>
<topNode xmlns="urn:dslforum-org:service-1-0"><childNode>content</childNode></topNode>
If you remove the register_namespace() call, then you get this monstrosity:
<?xml version='1.0' encoding='utf-8'?>
<ns0:topNode xmlns:ns0="urn:dslforum-org:service-1-0"><childNode>content</childNode></ns0:topNode>

xml read using panda in python

i have a xml file.i trying to read it in a usual way as shown below
def xmlfilereadread(self,path):
doc = minidom.parse(path)
Account = doc.getElementsByTagName("sf:ReceiverSet")[0]
num = Account.getAttribute('totalNo')
aList = []
for i in range(int(num)):
print(i)
AccountReference = doc.getElementsByTagName("sf:Receiver")[i]
but i need to use panda unstead of this code.how can i read data.my sample xml code is
<?xml version="1.0" encoding="UTF-8"?>
<sf:IFile xmlns:sf="http://www.canadapost.ca/smartflow" sequenceNo="10">
<sf:ReceiverSet documentTypes="TAXBILL" organization="lincolntax" totalNo="3">
<sf:Receiver sequenceNo="1" correlationID="1114567890123456789">
<sf:AccountReference>11145678901234567891111</sf:AccountReference>
<sf:SubscriptionAuth> <sf:ParamSet>
<sf:Param name="auth1">1114567890123456789</sf:Param>
<sf:Param name="auth2">CARTER, JOE</sf:Param> </sf:ParamSet>
</sf:SubscriptionAuth>
</sf:Receiver> <sf:Receiver sequenceNo="2" correlationID="2224567890123456789">
<sf:AccountReference>22245678901234567892222</sf:AccountReference> <sf:SubscriptionAuth> <sf:ParamSet>
<sf:Param name="auth1">2224567890123456789</sf:Param>
<sf:Param name="auth2">DOE, JANE</sf:Param> </sf:ParamSet>
</sf:SubscriptionAuth> </sf:Receiver> <sf:Receiver sequenceNo="3" correlationID="3334567890123456789">
<sf:AccountReference>33345678901234567893333</sf:AccountReference> <sf:SubscriptionAuth> <sf:ParamSet>
<sf:Param name="auth1">3334567890123456789</sf:Param> <sf:Param name="auth2">SOZE, KEYSER</sf:Param>
</sf:ParamSet> </sf:SubscriptionAuth> </sf:Receiver> </sf:ReceiverSet> </sf:IFile>

XML is an inherently hierarchical data format, and the most natural
way to represent it is with a tree. ET has two classes for this
purpose - ElementTree represents the whole XML document as a tree, and
Element represents a single node in this tree. Interactions with the
whole document (reading and writing to/from files) are usually done on
the ElementTree level. Interactions with a single XML element and its
sub-elements are done on the Element level
.
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or you can use lxml
from lxml import etree
root = etree.parse(r'local-path-to-.xml')
print (etree.tostring(root))

I want to update value of a particular xml tag using python code?

My xml file looks like below :-
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Messages xmlns="URL/sampleMessages-v1">
<Header>
<TransactionId>0</TransactionId>
<RequestNo>41194812</RequestNo>
<VNo>6789</VNo>
<Source></Source>
</Header>
...
...
</Messages>
I want to read it and change the RequestNo value
<RequestNo>41194812</RequestNo> to
<RequestNo>41194000</RequestNo>
I am using ElementTree module currently. I am using windows machine currently.
I want to update the value in the same file.
Ihave tried below code :-
for elem in root:
for subelem in elem:
#print (subelem.tag)
if 'RequestNo' in subelem.tag :
#print (subelem.text)
subelem.text="41194813"
But i am not able to see the change or i dont know currently how to write this new value subelem.text="41194813" in existing xml file.

Your for loop does the job: it did replace the text correctly. The change is in your root variable. You can verify that by adding the following line right after the for loop:
ElementTree.dump(root)
Now that you have the XML updated, you will need to write that into a file:
tree.write('newfile.xml')
Where tree is the result of ElementTree.parse(). So, to put everything together:
tree = ElementTree.parse('messages.xml')
root = tree.getroot()
for elem in root:
for subelem in elem:
if 'RequestNo' in subelem.tag:
subelem.text = '41194813'
break
tree.write('messages-new.xml')
Dealing with Namespaces
Your XML document contains namespaces, so if you plan to search for a tag, you need to include the namespaces in the tag names. Here is an alternative solution which deals with namespaces:
tree = ElementTree.parse('messages.xml')
root = tree.getroot()
namespaces = {'xxx': 'URL/sampleMessages-v1'}
node = root.find('xxx:Header/xxx:RequestNo', namespaces)
if node is not None:
node.text = '41194813'
tree.write('messages-new.xml')
In the above example, I just gave your namespace the name 'xxx', it can be anything 'foo', 'bar', ... but should be used as prefix in the call to root.find().
Removing "ns0" from Output File
In order to remove "ns0" from output file, you need to register the namespace before writing:
ElementTree.register_namespace('', 'URL/sampleMessages-v1')
tree.write('messages-new.xml')

Python xml ElementTree findall returns empty result

I would like to parse following XML file using the Python xml ElementTree API.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<foos>
<foo_table>
<!-- bar -->
<fooelem>
<fname>BBBB</fname>
<group>SOMEGROUP</group>
<module>some module</module>
</fooelem>
<fooelem>
<fname>AAAA</fname>
<group>other group</group>
<module>other module</module>
</fooelem>
<!-- bar -->
</foo_table>
</foos>
In this example code I try to find all the elements under /foos/foo_table/fooelem/fname but obviously findall doesn't find anything when running this code.
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file="min.xml")
for i in tree.findall("./foos/foo_table/fooelem/fname"):
print i
root = tree.getroot()
for i in root.findall("./foos/foo_table/fooelem/fname"):
print i
I am not experienced with the ElementTree API, but I've used the example under https://docs.python.org/2/library/xml.etree.elementtree.html#example. Why is it not working in my case?

foos is your root, you would need to start findall below, e.g.
root = tree.getroot()
for i in root.findall("foo_table/fooelem/fname"):
print i.text
Output:
BBBB
AAAA

This is because the path you are using begins BEFORE the root element (foos).
Use this instead: foo_table/fooelem/fname

findall doesn't work, but this does:
e = xml.etree.ElementTree.parse(myfile3).getroot()
mylist=list(e.iter('checksum'))
print (len(mylist))
mylist will have the proper length.

How to detect the root xml element of <?xml version="1.0" encoding="UTF-8"?> using Python and ElementTree

I am parsing an XML file that I expect the root element to be <data>. However, some users have modified these files and added the element <?xml version="1.0" encoding="UTF-8"?> at the top. I want to check to see if that exists and then fail my test to notify the user of this issue. I've tried to do the following but it keeps detecting the proper root element of <data>. Here is what I have so far.
<?xml version="1.0" encoding="UTF-8"?>
<data>
</data>
elementTree = self.param2
root = elementTree.find('.')
print root.tag
What I get to print out is:
data
(which is not what I expected).
Any ideas would be appreciated!

If you are using a proper XML API such as xml.dom or ElementTree, you should not have any problem dealing with XML declaration. However, if you still insist on removing the declaration, try this:
from xml.dom import minidom
def remove_xml_declaration(xml_text):
doc = minidom.parseString(xml_text)
root = doc.documentElement
xml_text_without_declaration = root.toxml(doc.encoding)
return xml_text_without_declaration
#
# Test
#
xml_text = '''<?xml version="1.0" encoding="UTF-8"?>
<data>
</data>
'''
# Remove declaration
xml_text = remove_xml_declaration(xml_text)
print xml_text
print '---'
# Remove declaration, event if it is not there
xml_text = remove_xml_declaration(xml_text)
print xml_text
print '---'

Well, I appreciate all the responses. However, I didn't want to remove it, I only wanted to detect it and have the user/developer remove it. Here is what I did to detect it.
import re
# The beginning of an XML Declaration to match.
xmlRegex = '(<\\?xml)'
rg = re.compile(xmlRegex, re.IGNORECASE | re.DOTALL)
lineCount = 0
with open("c:\file.xml") as f:
for line in f:
lineCount += 1
match = rg.search(line)
if match:
self.assertTrue(False, logger.failed("An XML Declaration was detected on line: " + str(lineCount) + "."))
else:
pass

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Programmatically clean/ignore namespaces in XML - python - python

Related

Generate XML Document in Python 3 using Namespaces and ElementTree

xml read using panda in python

I want to update value of a particular xml tag using python code?

Python xml ElementTree findall returns empty result

How to detect the root xml element of <?xml version="1.0" encoding="UTF-8"?> using Python and ElementTree

Categories

Resources