strange output from Python 3 ElementTree - python

I'm parsing a really simple .xml file with this snippet
import xml.etree.ElementTree as etree
tree = etree.parse('/home/user/dummy.xml')
print(tree.getroot())
the output is
<Element 'doc' at 0x1d2f090>
which is correct, but I was expecting something cleaner and as simple as
doc
is this the normal output ? How I can clean this ?
I'm using Python 3.x
the dummy.xml file
<?xml version="1.0"?>
<doc>
<branch name="testing" hash="1cdf045c">
text,source
</branch>
<branch name="release01" hash="f200013e">
<sub-branch name="subrelease01">
xml,sgml
</sub-branch>
</branch>
<branch name="invalid">
</branch>
</doc>

Yes, that's the default output for an Element. If you want just the tag, try:
print(tree.getroot().tag)

Related

How can I retrieve specific information from a XML file using python?

I am working with Sentinel-2 Images, and I want to retrieve the Cloud_Coverage_Assessment from the XML file. I need to do this with Python.
Does anyone have any idea how to do this? I think I have to use the xml.etree.ElementTree but I'm not sure how?
The XML file:
<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
...
</n1:General_Info>
<n1:Geometric_Info>
...
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
...
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
...
</Technical_Quality_Assessment>
<Quality_Control_Checks>
...
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>
read xml from file
import xml.etree.ElementTree as ET
tree = ET.parse('sentinel2.xml')
root = tree.getroot()
print(root.find('.//Cloud_Coverage_Assessment').text)
..and I want to retrieve the Cloud_Coverage_Assessment
Try the below (use xpath)
import xml.etree.ElementTree as ET
xml = '''<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
</n1:General_Info>
<n1:Geometric_Info>
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
</Technical_Quality_Assessment>
<Quality_Control_Checks>
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>'''
root = ET.fromstring(xml)
print(root.find('.//Cloud_Coverage_Assessment').text)
output
90.7287

How to print CData content while parsing xml local file with python?

So I have a XML file from a local folder that I want to scrape using Python. It has CData and looks like this:
<?xml version="1.0" encoding="utf-8"?>
<trial xmlns="urn::trial">
<drksId><![CDATA[DRKS00000024]]></drksId>
<firstDrksPublishDate><![CDATA[2008-09-05T12:36:54.000+02:00]]></firstDrksPublishDate>
<firstPartnerPublishDate><![CDATA[2004-01-15T00:00:00.000+01:00]]></firstPartnerPublishDate>
......
I tried:
import xml.etree.ElementTree as Et
tree=Et.parse(filename)
root=tree.getroot()
print(root.find('drksId').text)
Output:
I am getting root.find('drksId') as None. Thanks in advance
Try to search element considering namespace:
ns = {'ns': 'urn::trial'}
drksId = root.find('./ns:drksId', ns)
print(drksId.text)

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

parsing XML file in python2.7

I know this is a very common question, but the kind of XML file and the kind of extraction of data i need is a little unique due to the nature of the xml file. So appreciate any help on the steps to extract the required data, with pyhton2.7
I have the below XML
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>Mango.XYZ_DIG_Team_ABCDEF_Mango_Review</members>
<members>Mango.XYZ_DIG_Team_Reporting_Mango_Review</members>
<members>Opportunity.A_T_Occupier_City_Job_List</members>
<name>ListView</name>
</types>
<types>
<members>Modify_All_Data_Permission</members>
<members>Opportunity_Alerts_Implementation</members>
<members>Process_Builder_Permission</members>
<members>Regional_Business_Support</members>
<members>Reports_Dashboards_Data_Export_for_Super_Users</members>
<name>PermissionSet</name>
</types>
<types>
<members>SolutionManager</members>
<members>Standard</members>
<name>Profile</name>
</types>
<types>
<members>Mango.Set Verified Date and System Id</members>
<members>Mango.Update Mango Site With Billing Street%2C City%2C Country</members>
<members>Mango.Update Family Id on Mango when created</members>
<members>Opportunity.Set Opportunity Name</members>
<name>WorkflowRule</name>
</types>
<version>38.0</version>
</Package>
i am trying to extract only the members from the PermissionSet block. So that eventually i will have a file, that only have the entries like
Modify_All_Data_Permission
Opportunity_Alerts_Implementation
Process_Builder_Permission
Regional_Business_Support
Reports_Dashboards_Data_Export_for_Super_Users
I have been able to extract only the 'name' tag by
from xml.dom import minidom
doc = minidom.parse("path_to_xmlFile")
t = doc.getElementsByTagName("types")
for n in t:
name = n.getElementsByTagName("name")[0]
print name.firstChild.data
How can i extract the members and save that to a file?
Note: the number of 'members' are not fixed they varies.
I can also try with a different library, if it serves the purpose.
Probably easiest to use XPath
import xml.etree.ElementTree as ET
root = ET.parse('file.xml').getroot()
for member in root.findall(".//members/")
print(member.text)
This may help you!
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
for data in root[1]:
print data.text

Adding processing instruction to xml before root element with cElementTree

I am using cElementTree library to produce xml files. Now I want to write .xsl file for better readability. That's why I need to add <?xml-stylesheet type="text/xsl" href="style.xsl"?> before first tag. Unfortunately I was able to put desired line only after first tag:
import xml.etree.cElementTree as Et
test_report = Et.Element("TEST_REPORT")
root = test_report
root.append(Et.ProcessingInstruction('xml-stylesheet', 'type="text/xsl" href="style.xsl"'))
...
...
tree = Et.ElementTree(root)
tree.write(self.file_name+"_result.xml")
Witch logically produces:
<TEST_REPORT>
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
...
...
</TEST_REPORT>
What I need is:
<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<TEST_REPORT>
...
...
</TEST_REPORT>
I am looking for something like this but it seems like there is no addprevious method in cElementTree.

Categories

Resources