Python ElementTree unable to parse xml file correctly - python

I am trying to Parse an XML file using elemenTree of Python.
The xml file is like below:
<App xmlns="test attribute">
<name>sagar</name>
</App>
Parser Code:
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as etree
def parser():
eleTree = etree.parse('app.xml')
eleRoot = eleTree.getroot()
print("Tag:"+str(eleRoot.tag)+"\nAttrib:"+str(eleRoot.attrib))
if __name__ == "__main__":
parser()
Output:
[sagar#linux Parser]$ python test.py
Tag:{test attribute}App <------------- It should print only "App"
Attrib:{}
When I remove "xmlns" attribute or rename "xmlns" attribute to something else the eleRoot.tag is printing correct value.
Why can't element tree unable to parse the tags properly when I have "xmlns" attribute in the tag. Am I missing some pre-requisite to parse an XML of this format using element tree?

Your xml uses the attribute xmlns, which is trying to define a default xml namespace. Xml namespaces are used to solve naming conflicts, and require a valid URI for their value, as such the value of "test attribute" is invalid, which appears to be troubling the parsing of your xml by etree.
For more information on xml namespaces see XML Namespaces at W3 Schools.
Edit:
After looking into the issue further it appears that the fully qualified name of an element when using a python's ElementTree has the form {namespace_url}tag_name. This means that, as you defined the default namespace of "test attribute", the fully qualified name of your "App" tag is infact {test attribute}App, which is what you're getting out of your program.
Source

Related

how to edit xml root attributes with python

Recently ive been messing around with editing xml files with a python script for a project im working on but i cant figure out how to edit the attributes of the root element.
for example the xml file would say:
<root width="200">
<element1>
</element1>
</root>
what i want to do is have my code find the width attribute and change it to some other value, i know how to edit elements after the root but not the root itself
code im using for changing attributes
You could use the following module xml.etree.ElementTree. With this module you can set up attributes using xml.etree.ElementTree.Element.set()
Here is an example of snippet you could use:
import xml.etree.ElementTree as ET
tree = ET.parse('input.xml')
root = tree.getroot()
root.set('width','400')
print(root.attrib)
tree.write('output.xml')

Python xml.etree.ElementTree 'findall()' method does not work with several namespaces

I am trying to parse an XML file with several namespaces. I already have a function which produces namespace map – a dictionary with namespace prefixes and namespace identifiers (example in the code). However, when I pass this dictionary to the findall() method, it works only with the first namespace but does not return anything if an element on XML path is in another namespace.
(It works only in case of the first namespace which has None as its prefix.)
Here is a code sample:
import xml.etree.ElementTree as ET
file - '.\folder\example_file.xml' # path to the file
xml_path = './DataArea/Order/Item/Price' # XML path to the element node
tree = ET.parse(file)
root = tree.getroot()
nsmap = dict([node for _, node in ET.iterparse(exp_file, events=['start-ns'])])
# This produces a dictionary with namespace prefixes and identifiers, e.g.
# {'': 'http://firstnamespace.example.com/', 'foo': 'http://secondnamespace.example.com/', etc.}
for elem in root.findall(xml_path, nsmap):
# Do something
EDIT:
On the mzjn's suggestion, I'm including sample XML file:
<?xml version="1.0" encoding="utf-8"?>
<SampleOrder xmlns="http://firstnamespace.example.com/" xmlns:foo="http://secondnamespace.example.com/" xmlns:bar="http://thirdnamespace.example.com/" xmlns:sta="http://fourthnamespace.example.com/" languageCode="en-US" releaseID="1.0" systemEnvironmentCode="PROD" versionID="1.0">
<ApplicationArea>
<Sender>
<SenderCode>4457</SenderCode>
</Sender>
</ApplicationArea>
<DataArea>
<Order>
<foo:Item>
<foo:Price>
<foo:AmountPerUnit currencyID="USD">58000.000000</foo:AmountPerUnit>
<foo:TotalAmount currencyID="USD">58000.000000</foo:TotalAmount>
</foo:Price>
<foo:Description>
<foo:ItemCode>259601</foo:ItemCode>
<foo:ItemName>PORTAL GUN 6UBC BLUE</foo:ItemName>
</foo:Description>
</foo:Item>
<bar:Supplier>
<bar:SupplierID>4474</bar:SupplierID>
<bar:SupplierName>APERTURE SCIENCE, INC</bar:SupplierName>
</bar:Supplier>
<sta:DeliveryLocation>
<sta:RecipientID>103</sta:RecipientID>
<sta:RecipientName>WARHOUSE 664</sta:RecipientName>
</sta:DeliveryLocation>
</Order>
</DataArea>
</SampleOrder>
You should specify the namespaces in your xml_path, for example: ./foo:DataArea/Order/Item/bar:Price. The reason it works with the empty namespace is because it is the default, you don't have to specify that one in your path.
Based on Jan Jaap Meijerink's answer and mzjn's comments under the question, the solution is to insert namespace prefixed in the XML path. This can be done by inserting a wildcard {*} as mzjn's comment and this answer (https://stackoverflow.com/a/62117710/407651) suggest.
To document the solution, you can add this simple operation to your code:
xml_path = './DataArea/Order/Item/Price/TotalAmount'
xml_path_splitted_to_list = xml_path.split('/')
xml_path_with_wildcard_prefix = '/{*}'.join(xml_path_splitted_to_list)
In case there are two or more nodes with the same XML path but different namespaces, findall() method (quite naturally) accesses all of those element nodes.

Python signxml XML signature package. How to add xml placehoder for Signature tag?

I am new to Python. I have installed signxml package and I am doing xml signature process.
Link to python package : https://pypi.org/project/signxml/
My xml file is getting generated. However XML signature code is little different. I was able to match most of the part but I donot have idea how to match following one.
Can any one please help me for that.
Different part is following tag
<Signature>
Above part should be like this one
<Signature xmlns="http://www.w3.org/2000/09/xmldsig#">
When i searched into signxml core file i found following note.
To specify the location of an enveloped signature within **data**, insert a
``<ds:Signature Id="placeholder"></ds:Signature>`` element in **data** (where
"ds" is the "http://www.w3.org/2000/09/xmldsig#" namespace). This element will
be replaced by the generated signature, and excised when generating the digest.
How to make change to get this changed.
Following is my python code
from lxml import etree
import xml.etree.ElementTree as ET
from signxml import XMLSigner, XMLVerifier
import signxml
el = ET.parse('example.xml')
root = el.getroot()
#cert = open("key/public.pem").read()
key = open("key/private.pem").read()
signed_root = XMLSigner(method=signxml.methods.enveloped,signature_algorithm='rsa-sha512',digest_algorithm="sha512").sign(root, key=key)
tree = ET.ElementTree(signed_root)
#dv = tree.findall(".//DigestValue");
#print(dv);
tree.write("new_signed_file.xml")
What above code doing is. It takes one xml file and do digital signature process and generates new file.
Can anyone please guide me where and what change should i do for this requirements ?
I am assuming you are using python signxml
Go to python setup and open this file Python\Lib\site-packages\signxml\ __init__.py
Open __init__.py file and do following changes.
Find following code
def _unpack(self, data, reference_uris):
sig_root = Element(ds_tag("Signature"), nsmap=self.namespaces)
Change with following code.
def _unpack(self, data, reference_uris):
#sig_root = Element(ds_tag("Signature"), nsmap=self.namespaces)
sig_root = Element(ds_tag("Signature"), xmlns="http://www.w3.org/2000/09/xmldsig#")
After doing this change re-compile your python signxml package.
Re-generate new xml signature file.

Python 2.7.16 - ImportError: No module named etree.ElementTree

I am making a script to perform creating and writing data to XML file. The error is no module no module name
I refer to this stackoverflow link, Python 2.5.4 - ImportError: No module named etree.ElementTree. I refer to this tutorial, https://stackabuse.com/reading-and-writing-xml-files-in-python/. I still do not understand on what is the solution. I tried to replace
"from elementtree import ElementTree"
to
"from xml.etree import ElementTree"
It still did not work.
#!/usr/bin/python
import xml.etree.ElementTree as xml
root = xml.Element("FOLDER")
child = xml.Element("File")
root.append(child)
fn = xml.SubElement(child, "PICTURE")
fn.text = "he32dh32rf43hd23"
md5 = xml.SubElement(child, "CONTENT")
md5.text = "he32dh32rf43hd23"
tree = xml.ElementTree(root)
with open(xml.xml, "w") as fh:
tree.write(fh)
"""
I expect the result to be that data is written to xml file. But I received an error shown below,
File "./xml.py", line 2, in <module>
import xml.etree.ElementTree as xml
File "/root/Desktop/virustotal/testxml/xml.py", line 2, in <module>
import xml.etree.ElementTree as xml
```ImportError: No module named etree.ElementTree
import xml.etree.ElementTree as xml
and make sure you have the __init__.py file within the same folder if you use your own xml module and please avoid the path conflict.
then it will work.
etree package is provided by "ElementTree" and "lxml" both are similar but it is reported that ElementTree have bugs in python 2.7 and works great in python3.
I see you are using python 2.7 so lxml will work fine for you.
try this
from lxml import etree
from io import StringIO
tree = etree.parse(StringIO(xml_file))
# incase you need to read an XML.
print(tree.getroot())
And the StringIO is from default python io package.
StringIO is neccessary when you are passing file to it (I mean putting XML in a file and passing that file to parser).
It's good to keep it even tough you are passing XML as a big string.
all the writing operations will be same for both.

Python lxml: how to validate file using xml schema file defined in the xml file

I'm using Python lxml library to parse xml files. I need to validate a xml file against a xml schema. lxml supports XML schema validation, but the xml schema filepath/content must be provided (information available here: http://lxml.de/validation.html). However, I don't know the xml schema filepath beforehand, it should be parsed from the xml file header tags. I can't find a way to access these tags.
lxml supports this use case somehow?
If the schema is linked using attributes on the root element, in the http://www.w3.org/2001/XMLSchema-instance namespace, you can retrieve these with lxml by prefixing the attribute name with the namespace URL in curly braces:
XMLSchemaNamespace = '{http://www.w3.org/2001/XMLSchema-instance}'
document = lxml.parse(xmlfile)
schemaLink = document.get(XMLSchemaNamespace + 'schemaLocation')
if schemaLink is None:
schemaLink = document.get(XMLSchemaNamespace + 'noNamespaceSchemaLocation')
and then use a URL library to load the schema from the location referred. See lxml namespace handling for more information.

Categories

Resources