Simple dom traversing in Python using xml.etree.ElementTree

Simple dom traversing in Python using xml.etree.ElementTree - python

E.g. consider parsing a pom.xml file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>
<groupId>com.parent</groupId>
<artifactId>parent</artifactId>
<version>1.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<modelVersion>2.0.0</modelVersion>
<groupId>com.parent.somemodule</groupId>
<artifactId>some_module</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>Some Module</name>
...
Code:
import xml.etree.ElementTree as ET
tree = ET.parse(pom)
root = tree.getroot()
groupId = root.find("groupId")
artifactId = root.find("artifactId")
Both groupId and artifactId are None. Why when they are the direct descendants of the root? I tried to replace the root with tree (groupId = tree.find("groupId")) but that didn't change anything.

The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.

Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with
element = dom.getElementsByTagNameNS('*','elementname')
This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.

Related

how to edit xml root attributes with python

Recently ive been messing around with editing xml files with a python script for a project im working on but i cant figure out how to edit the attributes of the root element.
for example the xml file would say:
<root width="200">
<element1>
</element1>
</root>
what i want to do is have my code find the width attribute and change it to some other value, i know how to edit elements after the root but not the root itself
code im using for changing attributes

You could use the following module xml.etree.ElementTree. With this module you can set up attributes using xml.etree.ElementTree.Element.set()
Here is an example of snippet you could use:
import xml.etree.ElementTree as ET
tree = ET.parse('input.xml')
root = tree.getroot()
root.set('width','400')
print(root.attrib)
tree.write('output.xml')

Adding a subElement into a subElement in xml with python

I am working with xml files in python, and I want to ask if there is a way to add a subelement into an other subelement into the xml file.
If for the example, the structure of the xml file is as follow, and I want to add a new subelement under the container 'b'. how can I do it ?
<?xml version .....>
<module name=....>
<augment ....>
<container name="a">
</container>
<container name="b">
</container>
</augment>
</module>

If you want to do it in more future-proof way, you may want to use some kind of xml parser, i.e. lxml.etree. You could parse your xml, work on its elements and eventually dump it later back to a file. There is simple working example:
from lxml import etree
xml = '''<module name="x">
<augment name="y">
<container name="a">
</container>
<container name="b">
</container>
</augment>
</module>'''.strip()
xtree = etree.fromstring(xml)
for element in xtree.xpath('.//container[#name="b"]'):
new = etree.Element('something') # create new element to be inserted
new.set('name', 'xyz') # define some attributes for new element
element.append(new) # append it to your currently-processed element
print(etree.tostring(xtree,pretty_print=True).decode('ascii'))
For more see lxml documentation (https://lxml.de/tutorial.html)

Python xml.etree.ElementTree 'findall()' method does not work with several namespaces

I am trying to parse an XML file with several namespaces. I already have a function which produces namespace map – a dictionary with namespace prefixes and namespace identifiers (example in the code). However, when I pass this dictionary to the findall() method, it works only with the first namespace but does not return anything if an element on XML path is in another namespace.
(It works only in case of the first namespace which has None as its prefix.)
Here is a code sample:
import xml.etree.ElementTree as ET
file - '.\folder\example_file.xml' # path to the file
xml_path = './DataArea/Order/Item/Price' # XML path to the element node
tree = ET.parse(file)
root = tree.getroot()
nsmap = dict([node for _, node in ET.iterparse(exp_file, events=['start-ns'])])
# This produces a dictionary with namespace prefixes and identifiers, e.g.
# {'': 'http://firstnamespace.example.com/', 'foo': 'http://secondnamespace.example.com/', etc.}
for elem in root.findall(xml_path, nsmap):
# Do something
EDIT:
On the mzjn's suggestion, I'm including sample XML file:
<?xml version="1.0" encoding="utf-8"?>
<SampleOrder xmlns="http://firstnamespace.example.com/" xmlns:foo="http://secondnamespace.example.com/" xmlns:bar="http://thirdnamespace.example.com/" xmlns:sta="http://fourthnamespace.example.com/" languageCode="en-US" releaseID="1.0" systemEnvironmentCode="PROD" versionID="1.0">
<ApplicationArea>
<Sender>
<SenderCode>4457</SenderCode>
</Sender>
</ApplicationArea>
<DataArea>
<Order>
<foo:Item>
<foo:Price>
<foo:AmountPerUnit currencyID="USD">58000.000000</foo:AmountPerUnit>
<foo:TotalAmount currencyID="USD">58000.000000</foo:TotalAmount>
</foo:Price>
<foo:Description>
<foo:ItemCode>259601</foo:ItemCode>
<foo:ItemName>PORTAL GUN 6UBC BLUE</foo:ItemName>
</foo:Description>
</foo:Item>
<bar:Supplier>
<bar:SupplierID>4474</bar:SupplierID>
<bar:SupplierName>APERTURE SCIENCE, INC</bar:SupplierName>
</bar:Supplier>
<sta:DeliveryLocation>
<sta:RecipientID>103</sta:RecipientID>
<sta:RecipientName>WARHOUSE 664</sta:RecipientName>
</sta:DeliveryLocation>
</Order>
</DataArea>
</SampleOrder>

You should specify the namespaces in your xml_path, for example: ./foo:DataArea/Order/Item/bar:Price. The reason it works with the empty namespace is because it is the default, you don't have to specify that one in your path.

Based on Jan Jaap Meijerink's answer and mzjn's comments under the question, the solution is to insert namespace prefixed in the XML path. This can be done by inserting a wildcard {*} as mzjn's comment and this answer (https://stackoverflow.com/a/62117710/407651) suggest.
To document the solution, you can add this simple operation to your code:
xml_path = './DataArea/Order/Item/Price/TotalAmount'
xml_path_splitted_to_list = xml_path.split('/')
xml_path_with_wildcard_prefix = '/{*}'.join(xml_path_splitted_to_list)
In case there are two or more nodes with the same XML path but different namespaces, findall() method (quite naturally) accesses all of those element nodes.

How to preserve namespaces when parsing xml via ElementTree in Python

Assume that I've the following XML which I want to modify using Python's ElementTree:
<root xmlns:prefix="URI">
<child company:name="***"/>
...
</root>
I'm doing some modification on the XML file like this:
import xml.etree.ElementTree as ET
tree = ET.parse('filename.xml')
# XML modification here
# save the modifications
tree.write('filename.xml')
Then the XML file looks like:
<root xmlns:ns0="URI">
<child ns0:name="***"/>
...
</root>
As you can see, the namepsace prefix changed to ns0. I'm aware of using ET.register_namespace() as mentioned here.
The problem with ET.register_namespace() is that:
You need to know prefix and URI
It can not be used with default namespace.
e.g. If the xml looks like:
<root xmlns="http://uri">
<child name="name">
...
</child>
</root>
It will be transfomed to something like:
<ns0:root xmlns:ns0="http://uri">
<ns0:child name="name">
...
</ns0:child>
</ns0:root>
As you can see, the default namespace is changed to ns0.
Is there any way to solve this problem with ElementTree?

ElementTree will replace those namespaces' prefixes that are not registered with ET.register_namespace. To preserve a namespace prefix, you need to register it first before writing your modifications on a file. The following method does the job and registers all namespaces globally,
def register_all_namespaces(filename):
namespaces = dict([node for _, node in ET.iterparse(filename, events=['start-ns'])])
for ns in namespaces:
ET.register_namespace(ns, namespaces[ns])
This method should be called before ET.parse method, so that the namespaces will remain as unchanged,
import xml.etree.ElementTree as ET
register_all_namespaces('filename.xml')
tree = ET.parse('filename.xml')
# XML modification here
# save the modifications
tree.write('filename.xml')

Append element with SAX in python

I know how to parse xml with sax in python, but how would I go about inserting elements into the document i'm parsing? Do I have to create a separate file?
Could someone provide a simple example or alter the one I've put below. Thanks.
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
import sys
class aHandler(ContentHandler):
def startElement(self, name, attrs):
print "<",name,">"
def characters(self, content):
print content
def endElement(self,name):
print "</",name,">"
handler = aHandler()
saxparser = make_parser()
saxparser.setContentHandler(handler)
datasource = open("settings.xml","r")
saxparser.parse(datasource)
<?xml version="1.0"?>
<names>
<name>
<first>First1</first>
<second>Second1</second>
</name>
<name>
<first>First2</first>
<second>Second2</second>
</name>
<name>
<first>First3</first>
<second>Second3</second>
</name>
</names>

With DOM, you have the entire xml structure in memory.
With SAX, you don't have a DOM available, so you don't have anything to append an element to.
The main reason for using SAX is if the xml structure is really, really huge-- if it would be a serious performance hit to place the DOM in memory. If that isn't the case (as it appears to be from your small sample xml file), I would always use DOM vs. SAX.
If you go the DOM route, (which seems to be the only option to me), look into lxml. It's one of the best python xml libraries around.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simple dom traversing in Python using xml.etree.ElementTree - python

The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.

Related

how to edit xml root attributes with python

Adding a subElement into a subElement in xml with python

Python xml.etree.ElementTree 'findall()' method does not work with several namespaces

How to preserve namespaces when parsing xml via ElementTree in Python

Append element with SAX in python

Categories

Resources