Parsing XML Attributes with Python - python

I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.

Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)

Related

How can I retrieve specific information from a XML file using python?

I am working with Sentinel-2 Images, and I want to retrieve the Cloud_Coverage_Assessment from the XML file. I need to do this with Python.
Does anyone have any idea how to do this? I think I have to use the xml.etree.ElementTree but I'm not sure how?
The XML file:
<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
...
</n1:General_Info>
<n1:Geometric_Info>
...
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
...
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
...
</Technical_Quality_Assessment>
<Quality_Control_Checks>
...
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>
read xml from file
import xml.etree.ElementTree as ET
tree = ET.parse('sentinel2.xml')
root = tree.getroot()
print(root.find('.//Cloud_Coverage_Assessment').text)
..and I want to retrieve the Cloud_Coverage_Assessment
Try the below (use xpath)
import xml.etree.ElementTree as ET
xml = '''<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
</n1:General_Info>
<n1:Geometric_Info>
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
</Technical_Quality_Assessment>
<Quality_Control_Checks>
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>'''
root = ET.fromstring(xml)
print(root.find('.//Cloud_Coverage_Assessment').text)
output
90.7287

parsing and modifying an xml file with CDATA sections

I would like to programmatically modify some XML files but I end up adding some modifications inadvertently. For example consider the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<!-- A comment
-->
<abc:Tag xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:abc="http://www.mycompany.com" xmlns:def="http://www.anothercompany.com">
<abc:sometext oneattribute="Hello" anotherattribute="World">
Some random boring text.
</abc:sometext>
<def:somecode>
<![CDATA[
if a>=b:
print(a)
]]>
</def:somecode>
</abc:Tag>
I am trying to add a simple a comment in the code included in the CDATA section. To do so I am using the following python script that manages to handle the namespaces correctly and add the string. However, the CDATA is lost in the output:
import sys
from lxml import etree as ET
xml_file = sys.argv[1]
tree = ET.parse(xml_file)
root = tree.getroot()
ns = {}
element_tree = ET.iterparse(xml_file, events=["start-ns"])
try:
for event, (prefix, qualified_name) in element_tree:
ET.register_namespace(prefix, qualified_name)
ns[prefix] = qualified_name
except ET.ParseError as err:
sys.exit(1)
for somecode in tree.findall('def:somecode', namespaces=ns):
somecode.text = somecode.text + "# updated with a comment"
tree.write('output.xml',
xml_declaration=True,
encoding="UTF-8")
The resulting output is different than the input in two ways I didn't expect and don't know how to correct:
Single quotes are replaced by double
The code in CDATA is printed as normal text

xml minidom - get the full content of childnodes text

I have a Test.xml file as:
<?xml version="1.0" encoding="utf-8"?>
<SetupConf>
<LocSetup>
<Src>
<Dir1>C:\User1\test1</Dir1>
<Dir2>C:\User2\log</Dir2>
<Dir3>D:\Users\Checkup</Dir3>
<Dir4>D:\Work1</Dir4>
<Dir5>E:\job1</Dir5>
</Src>
</LocSetup>
</SetupConf>
Where node depends on user input. In "Dir" node it may be 1,2,5,10 dir structure defined. As per requirement I am able to extract data from the Test.xml with help of #Padraic Cunningham using below Python code:
from xml.dom import minidom
from StringIO import StringIO
dom = minidom.parse('Test.xml')
Src = dom.getElementsByTagName('Src')
output = ", ".join([a.childNodes[0].nodeValue for node in Src for a in node.getElementsByTagName('Dir')])
print [output]
And getting the output:
C:\User1\test1, C:\User2\log, D:\Users\Checkup, D:\Work1, E:\job1
But the expected output is:
['C:\\User1\\test1', 'C:\\User2\\log', 'D:\\Users\\Checkup', 'D:\\Work1', 'E:\\job1']
Well it's solved by myself:
from xml.dom import minidom
DOMTree = minidom.parse('Test0001.xml')
dom = DOMTree.documentElement
Src = dom.getElementsByTagName('Src')
for node in Src:
output = [a.childNodes[0].nodeValue for a in node.getElementsByTagName('Dir')]
print output
And getting output:
[u'C:\User1\test1', u'C:\User2\log', u'D:\Users\Checkup', u'D:\Work1', u'E:\job1']
I am sure there is more simple another way .. please let me know.. Thanks in adv.

Adding attribute to child elements

I am trying to add an attribute to all child elements in all XML files in the current directory. This attribute should be equal to the length of each string. For example, the XML looks like this:
<?xml version="1.0" encoding="utf-8?>
<RootElement>
<String Id="PythonLove">I love Python.</String>
</RootElement>
So, if this worked the way it should, it would leave the child opening tag looking like this:
<String Id="PythonLove" length="14">
I have read many forums and all point to either .set or .attrib to add attributes into an existing XML. Neither of these have any effect on the files though. My script currently looks like this:
for child in root:
length_limit = len(child.text)
child.set('length', length_limit)
I've also tried child.attrib['length'] = length_limit. This also doesn't work. What am I doing wrong?
Thanks
You need to convert the value to string before set.
>>> xml = '''<?xml version="1.0" encoding="utf-8"?>
... <RootElement>
... <String Id="PythonLove">I love Python.</String>
... </RootElement>
... '''
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(xml)
>>> for child in root:
... child.set('length', str(len(child.text))) # <---
...
>>> print(ET.tostring(root).decode())
<RootElement>
<String Id="PythonLove" length="14">I love Python.</String>
</RootElement>
Got it! Pretty elated because that was a couple weeks of struggles. I ended up just writing to 'infile' (used for iterating through the files in the cwd) and it worked to overwrite the existing XML (had to register the namespace first which was another little hump I ran into). Full code:
import fileinput
import os, glob
import xml.etree.ElementTree as ET
path = os.getcwd()
for infile in glob.glob(os.path.join(path, '*.xml')):
try:
tree = ET.parse(infile)
root = tree.getroot() # sets variable 'root' to the root element
for child in root:
string_length = str(len(child.text))
child.set('length', length_limit)
ET.register_namespace('',"http://schemas.microsoft.com/wix/2006/XML")
tree.write(infile)

python ElementTree the text of element who has a child

When I try to read a text of a element who has a child, it gives None:
See the xml (say test.xml):
<?xml version="1.0"?>
<data>
<test><ref>MemoryRegion</ref> abcd</test>
</data>
and the python code that wants to read 'abcd':
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print root.find("test").text
When I run this python, it gives None, rather than abcd.
How can I read abcd under this condition?
Use Element.tail attribute:
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.xml')
>>> root = tree.getroot()
>>> print root.find(".//ref").tail
abcd
ElementTree has a rather different view of XML that is more suited for nested data. .text is the data right after a start tag. .tail is the data right after an end tag. so you want:
print root.find('test/ref').tail

Categories

Resources